engine

`sqllocks_spindle.engine` ¶

Core generation engine.

Classes¶

`GenerationResult` `dataclass` ¶

Result of a generation run.

Attributes¶

`table_names` `property` ¶

Return list of table names in generation order.

Methods:¶

`get_lineage(table, column)` ¶

Look up lineage for a specific column.

`verify_integrity()` ¶

Verify referential integrity across all tables in parallel.

`len()` ¶

Return total row count across all tables.

`contains(table_name)` ¶

Check if a table exists in the result.

`iter()` ¶

Iterate over (table_name, DataFrame) pairs, in generation order.

Matches the documented quickstart pattern::

for table_name, df in result:
    spark.createDataFrame(df).write.saveAsTable(table_name)

Note: indexing and membership are name-keyed (result["order"], "order" in result); iteration yields pairs (like dict.items()), not bare keys.

`items()` ¶

Return (table_name, DataFrame) pairs. Explicit alias for iteration.

`keys()` ¶

Return table names.

`values()` ¶

Return the DataFrames.

`to_csv(output_dir, max_workers=4, **kwargs)` ¶

Write all tables to CSV files in parallel. Returns list of file paths.

`to_parquet(output_dir, max_workers=4, **kwargs)` ¶

Write all tables to Parquet files in parallel. Requires pyarrow.

`to_jsonl(output_dir)` ¶

Write all tables to JSON Lines files.

`to_excel(output_path)` ¶

Write all tables to a single Excel file (one sheet per table). Requires openpyxl.

`to_sql(output_dir, **kwargs)` ¶

Write all tables as SQL INSERT files.

`to_dataframe(table_name)` ¶

Return the DataFrame for a given table. Alias for self[table_name].

`Spindle` ¶

Main entry point for Spindle data generation.

Methods:¶

`estimate_memory(domain=None, schema=None, scale=None, scale_overrides=None)` ¶

Estimate RAM usage in bytes per table and total.

`generate(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None, enforce_correlations=True, fidelity_profile=None)` ¶

Generate synthetic data.

Parameters:

Name	Type	Description	Default
`domain`		A Domain instance (e.g., RetailDomain()) with built-in schema.	`None`
`schema`	`str \| Path \| dict \| SpindleSchema \| None`	Path to .spindle.json, raw dict, or parsed SpindleSchema.	`None`
`scale`	`str \| None`	Scale preset name (small, medium, large, xlarge).	`None`
`scale_overrides`	`dict[str, int] \| None`	Override row counts for specific tables.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`
`on_progress`	`Callable[[str, int, int], None] \| None`	Optional callback(table_name, tables_done, tables_total).	`None`
`enforce_correlations`	`bool`	If True (default), apply GaussianCopula post-pass when correlated_columns metadata is present in the schema.	`True`
`fidelity_profile`		Optional DatasetProfile. When provided, returns a (GenerationResult, FidelityReport) tuple instead of GenerationResult.	`None`

`generate_stream(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None)` ¶

Generate synthetic data and yield each table as it completes.

Same signature as generate(). Yields (table_name, DataFrame) tuples in dependency order, allowing callers to write table N to a store while table N+1 is still being generated.

Example::

for table_name, df in spindle.generate_stream(domain=RetailDomain(), scale="medium"):
    writer.write(table_name, df)

`describe(domain=None, schema=None)` ¶

Parse and return schema without generating data.

`ChunkedGenerationResult` `dataclass` ¶

Result of a chunked generation run.

Parent tables are fully materialized (small). Child tables are available only via iter_chunks() to keep memory bounded.

Methods:¶

`iter_chunks(table_name)` ¶

Yield DataFrames of chunk_size rows for a child table.

Must be called in dependency order. Each table can only be iterated once.

`write_with(writer, **kwargs)` ¶

Convenience: write parent tables, then stream child chunks through a writer.

The writer must implement either

write_table(table_name, df, **kwargs) for individual DataFrames, or
stage_chunk(table_name, chunk_df, idx) + copy_into(table_name) for bulk writers.

`ChunkedSpindle` ¶

Generate billion-row datasets in bounded memory.

Uses a two-pass approach: 1. Parent tables generated fully in-memory (typically small). 2. Child tables generated in chunks of chunk_size rows.

Example::

cs = ChunkedSpindle()
result = cs.generate_chunked(
    domain=FinancialDomain(),
    scale="warehouse",
    chunk_size=1_000_000,
)

# Parent tables are immediately available
for name, df in result.parent_tables.items():
    print(f"{name}: {len(df)} rows")

# Child tables stream via iterator
for table_name in result.child_table_names:
    for chunk in result.iter_chunks(table_name):
        writer.write(chunk)

Methods:¶

`generate_chunked(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, chunk_size=1000000, target_table=None, target_count=None)` ¶

Generate data with chunked child tables.

Parameters:

Name	Type	Description	Default
`domain`		A Domain instance.	`None`
`schema`	`Any`	Path to .spindle.json, raw dict, or parsed SpindleSchema.	`None`
`scale`	`str \| None`	Scale preset name.	`None`
`scale_overrides`	`dict[str, int] \| None`	Override row counts for specific tables.	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`
`chunk_size`	`int`	Rows per chunk for child tables.	`1000000`
`target_table`	`str \| None`	Anchor table name — derive all other table counts proportionally from this table's target_count.	`None`
`target_count`	`int \| None`	Number of rows for the anchor table. Required when target_table is provided.	`None`

Returns:

Type	Description
`ChunkedGenerationResult`	ChunkedGenerationResult with parent tables materialized and
`ChunkedGenerationResult`	child tables available via iter_chunks().

`SinkRegistry` ¶

Fan-out coordinator — dispatches each chunk to all registered sinks in parallel.

`SinkError` ¶

Bases: Exception

Raised when one or more sinks fail during write_chunk.

`ScaleRouter` ¶

Entry point for multi-process chunked generation with multi-sink fan-out.

Parameters:

Name	Type	Description	Default
`schema_path`	`str`	Path to a .json file containing a serialized SpindleSchema.	required
`sinks`	`list[Sink]`	List of Sink instances to receive generated data.	required
`chunk_size`	`int`	Rows per chunk. Default 500_000.	`500000`
`max_workers`	`int \| None`	Subprocess count. Default os.cpu_count() - 1. Capped automatically if the estimated working set would exceed 80 % of available RAM.	`None`

Methods:¶

`run(total_rows, seed=42)` ¶

Generate total_rows rows and fan out to all sinks.

Tables whose schema-derived row count is < chunk_size are treated as static (reference/dimension) tables: they are generated once with their natural cardinality and written to the sinks a single time. Their PK data is broadcast into every chunk worker so FK references resolve correctly without replication.

Tables whose schema-derived row count >= chunk_size are dynamic (fact) tables: they are generated chunk_size rows per chunk across ceil(total_rows / chunk_size) chunks.

Returns:

Type	Description
`dict`	Stats dict: rows_generated, elapsed_seconds, throughput_rows_per_sec,
`dict`	memory_peak_gb (estimated).

engine

sqllocks_spindle.engine ¶

Classes¶

GenerationResult dataclass ¶

Attributes¶

table_names property ¶

Methods:¶

get_lineage(table, column) ¶

verify_integrity() ¶

__len__() ¶

__contains__(table_name) ¶

__iter__() ¶

items() ¶

keys() ¶

values() ¶

to_csv(output_dir, max_workers=4, **kwargs) ¶

to_parquet(output_dir, max_workers=4, **kwargs) ¶

to_jsonl(output_dir) ¶

to_excel(output_path) ¶

to_sql(output_dir, **kwargs) ¶

to_dataframe(table_name) ¶

Spindle ¶

Methods:¶

estimate_memory(domain=None, schema=None, scale=None, scale_overrides=None) ¶

generate(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None, enforce_correlations=True, fidelity_profile=None) ¶

generate_stream(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None) ¶

describe(domain=None, schema=None) ¶

ChunkedGenerationResult dataclass ¶

Methods:¶

iter_chunks(table_name) ¶

write_with(writer, **kwargs) ¶

ChunkedSpindle ¶

Methods:¶

generate_chunked(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, chunk_size=1000000, target_table=None, target_count=None) ¶

SinkRegistry ¶

SinkError ¶

ScaleRouter ¶

Methods:¶

run(total_rows, seed=42) ¶

`sqllocks_spindle.engine` ¶

`GenerationResult` `dataclass` ¶

`table_names` `property` ¶

`get_lineage(table, column)` ¶

`verify_integrity()` ¶

`len()` ¶

`contains(table_name)` ¶

`iter()` ¶

`items()` ¶

`keys()` ¶

`values()` ¶

`to_csv(output_dir, max_workers=4, **kwargs)` ¶

`to_parquet(output_dir, max_workers=4, **kwargs)` ¶

`to_jsonl(output_dir)` ¶

`to_excel(output_path)` ¶

`to_sql(output_dir, **kwargs)` ¶

`to_dataframe(table_name)` ¶

`Spindle` ¶

`estimate_memory(domain=None, schema=None, scale=None, scale_overrides=None)` ¶

`generate(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None, enforce_correlations=True, fidelity_profile=None)` ¶

`generate_stream(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None)` ¶

`describe(domain=None, schema=None)` ¶

`ChunkedGenerationResult` `dataclass` ¶

`iter_chunks(table_name)` ¶

`write_with(writer, **kwargs)` ¶

`ChunkedSpindle` ¶

`generate_chunked(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, chunk_size=1000000, target_table=None, target_count=None)` ¶

`SinkRegistry` ¶

`SinkError` ¶

`ScaleRouter` ¶

`run(total_rows, seed=42)` ¶