Skip to content

generator

sqllocks_spindle.engine.generator

Main Spindle generator — the public API entry point.

Classes

ColumnLineage dataclass

Tracks which strategy produced a column's values.

GenerationResult dataclass

Result of a generation run.

Attributes
table_names property

Return list of table names in generation order.

Methods:
get_lineage(table, column)

Look up lineage for a specific column.

verify_integrity()

Verify referential integrity across all tables in parallel.

__len__()

Return total row count across all tables.

__contains__(table_name)

Check if a table exists in the result.

__iter__()

Iterate over (table_name, DataFrame) pairs, in generation order.

Matches the documented quickstart pattern::

for table_name, df in result:
    spark.createDataFrame(df).write.saveAsTable(table_name)

Note: indexing and membership are name-keyed (result["order"], "order" in result); iteration yields pairs (like dict.items()), not bare keys.

items()

Return (table_name, DataFrame) pairs. Explicit alias for iteration.

keys()

Return table names.

values()

Return the DataFrames.

to_csv(output_dir, max_workers=4, **kwargs)

Write all tables to CSV files in parallel. Returns list of file paths.

to_parquet(output_dir, max_workers=4, **kwargs)

Write all tables to Parquet files in parallel. Requires pyarrow.

to_jsonl(output_dir)

Write all tables to JSON Lines files.

to_excel(output_path)

Write all tables to a single Excel file (one sheet per table). Requires openpyxl.

to_sql(output_dir, **kwargs)

Write all tables as SQL INSERT files.

to_dataframe(table_name)

Return the DataFrame for a given table. Alias for self[table_name].

Spindle

Main entry point for Spindle data generation.

Methods:
estimate_memory(domain=None, schema=None, scale=None, scale_overrides=None)

Estimate RAM usage in bytes per table and total.

generate(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None, enforce_correlations=True, fidelity_profile=None)

Generate synthetic data.

Parameters:

Name Type Description Default
domain

A Domain instance (e.g., RetailDomain()) with built-in schema.

None
schema str | Path | dict | SpindleSchema | None

Path to .spindle.json, raw dict, or parsed SpindleSchema.

None
scale str | None

Scale preset name (small, medium, large, xlarge).

None
scale_overrides dict[str, int] | None

Override row counts for specific tables.

None
seed int | None

Random seed for reproducibility.

None
on_progress Callable[[str, int, int], None] | None

Optional callback(table_name, tables_done, tables_total).

None
enforce_correlations bool

If True (default), apply GaussianCopula post-pass when correlated_columns metadata is present in the schema.

True
fidelity_profile

Optional DatasetProfile. When provided, returns a (GenerationResult, FidelityReport) tuple instead of GenerationResult.

None
generate_stream(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None)

Generate synthetic data and yield each table as it completes.

Same signature as generate(). Yields (table_name, DataFrame) tuples in dependency order, allowing callers to write table N to a store while table N+1 is still being generated.

Example::

for table_name, df in spindle.generate_stream(domain=RetailDomain(), scale="medium"):
    writer.write(table_name, df)
describe(domain=None, schema=None)

Parse and return schema without generating data.

Functions:

calculate_row_counts(schema, overrides=None)

Return per-table row counts derived from the schema's generation config.

Exposed at module level so ScaleRouter and ChunkWorker can use it without instantiating a full Spindle object.

apply_compute_phase(tables, schema)

Module-level helper so ChunkWorker can call this without a Spindle instance.