engine
sqllocks_spindle.engine
¶
Core generation engine.
Classes¶
GenerationResult
dataclass
¶
Result of a generation run.
Attributes¶
table_names
property
¶
Return list of table names in generation order.
Methods:¶
get_lineage(table, column)
¶
Look up lineage for a specific column.
verify_integrity()
¶
Verify referential integrity across all tables in parallel.
__len__()
¶
Return total row count across all tables.
__contains__(table_name)
¶
Check if a table exists in the result.
__iter__()
¶
Iterate over (table_name, DataFrame) pairs, in generation order.
Matches the documented quickstart pattern::
for table_name, df in result:
spark.createDataFrame(df).write.saveAsTable(table_name)
Note: indexing and membership are name-keyed (result["order"],
"order" in result); iteration yields pairs (like dict.items()),
not bare keys.
items()
¶
Return (table_name, DataFrame) pairs. Explicit alias for iteration.
keys()
¶
Return table names.
values()
¶
Return the DataFrames.
to_csv(output_dir, max_workers=4, **kwargs)
¶
Write all tables to CSV files in parallel. Returns list of file paths.
to_parquet(output_dir, max_workers=4, **kwargs)
¶
Write all tables to Parquet files in parallel. Requires pyarrow.
to_jsonl(output_dir)
¶
Write all tables to JSON Lines files.
to_excel(output_path)
¶
Write all tables to a single Excel file (one sheet per table). Requires openpyxl.
to_sql(output_dir, **kwargs)
¶
Write all tables as SQL INSERT files.
to_dataframe(table_name)
¶
Return the DataFrame for a given table. Alias for self[table_name].
Spindle
¶
Main entry point for Spindle data generation.
Methods:¶
estimate_memory(domain=None, schema=None, scale=None, scale_overrides=None)
¶
Estimate RAM usage in bytes per table and total.
generate(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None, enforce_correlations=True, fidelity_profile=None)
¶
Generate synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
A Domain instance (e.g., RetailDomain()) with built-in schema. |
None
|
|
schema
|
str | Path | dict | SpindleSchema | None
|
Path to .spindle.json, raw dict, or parsed SpindleSchema. |
None
|
scale
|
str | None
|
Scale preset name (small, medium, large, xlarge). |
None
|
scale_overrides
|
dict[str, int] | None
|
Override row counts for specific tables. |
None
|
seed
|
int | None
|
Random seed for reproducibility. |
None
|
on_progress
|
Callable[[str, int, int], None] | None
|
Optional callback(table_name, tables_done, tables_total). |
None
|
enforce_correlations
|
bool
|
If True (default), apply GaussianCopula post-pass when correlated_columns metadata is present in the schema. |
True
|
fidelity_profile
|
Optional DatasetProfile. When provided, returns a (GenerationResult, FidelityReport) tuple instead of GenerationResult. |
None
|
generate_stream(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None)
¶
Generate synthetic data and yield each table as it completes.
Same signature as generate(). Yields (table_name, DataFrame)
tuples in dependency order, allowing callers to write table N to a
store while table N+1 is still being generated.
Example::
for table_name, df in spindle.generate_stream(domain=RetailDomain(), scale="medium"):
writer.write(table_name, df)
describe(domain=None, schema=None)
¶
Parse and return schema without generating data.
ChunkedGenerationResult
dataclass
¶
Result of a chunked generation run.
Parent tables are fully materialized (small). Child tables are available
only via iter_chunks() to keep memory bounded.
Methods:¶
iter_chunks(table_name)
¶
Yield DataFrames of chunk_size rows for a child table.
Must be called in dependency order. Each table can only be iterated once.
write_with(writer, **kwargs)
¶
Convenience: write parent tables, then stream child chunks through a writer.
The writer must implement either
write_table(table_name, df, **kwargs)for individual DataFrames, orstage_chunk(table_name, chunk_df, idx)+copy_into(table_name)for bulk writers.
ChunkedSpindle
¶
Generate billion-row datasets in bounded memory.
Uses a two-pass approach:
1. Parent tables generated fully in-memory (typically small).
2. Child tables generated in chunks of chunk_size rows.
Example::
cs = ChunkedSpindle()
result = cs.generate_chunked(
domain=FinancialDomain(),
scale="warehouse",
chunk_size=1_000_000,
)
# Parent tables are immediately available
for name, df in result.parent_tables.items():
print(f"{name}: {len(df)} rows")
# Child tables stream via iterator
for table_name in result.child_table_names:
for chunk in result.iter_chunks(table_name):
writer.write(chunk)
Methods:¶
generate_chunked(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, chunk_size=1000000, target_table=None, target_count=None)
¶
Generate data with chunked child tables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
A Domain instance. |
None
|
|
schema
|
Any
|
Path to .spindle.json, raw dict, or parsed SpindleSchema. |
None
|
scale
|
str | None
|
Scale preset name. |
None
|
scale_overrides
|
dict[str, int] | None
|
Override row counts for specific tables. |
None
|
seed
|
int | None
|
Random seed for reproducibility. |
None
|
chunk_size
|
int
|
Rows per chunk for child tables. |
1000000
|
target_table
|
str | None
|
Anchor table name — derive all other table counts proportionally from this table's target_count. |
None
|
target_count
|
int | None
|
Number of rows for the anchor table. Required when target_table is provided. |
None
|
Returns:
| Type | Description |
|---|---|
ChunkedGenerationResult
|
ChunkedGenerationResult with parent tables materialized and |
ChunkedGenerationResult
|
child tables available via iter_chunks(). |
SinkRegistry
¶
Fan-out coordinator — dispatches each chunk to all registered sinks in parallel.
SinkError
¶
Bases: Exception
Raised when one or more sinks fail during write_chunk.
ScaleRouter
¶
Entry point for multi-process chunked generation with multi-sink fan-out.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema_path
|
str
|
Path to a .json file containing a serialized SpindleSchema. |
required |
sinks
|
list[Sink]
|
List of Sink instances to receive generated data. |
required |
chunk_size
|
int
|
Rows per chunk. Default 500_000. |
500000
|
max_workers
|
int | None
|
Subprocess count. Default os.cpu_count() - 1. Capped automatically if the estimated working set would exceed 80 % of available RAM. |
None
|
Methods:¶
run(total_rows, seed=42)
¶
Generate total_rows rows and fan out to all sinks.
Tables whose schema-derived row count is < chunk_size are treated as static (reference/dimension) tables: they are generated once with their natural cardinality and written to the sinks a single time. Their PK data is broadcast into every chunk worker so FK references resolve correctly without replication.
Tables whose schema-derived row count >= chunk_size are dynamic (fact) tables: they are generated chunk_size rows per chunk across ceil(total_rows / chunk_size) chunks.
Returns:
| Type | Description |
|---|---|
dict
|
Stats dict: rows_generated, elapsed_seconds, throughput_rows_per_sec, |
dict
|
memory_peak_gb (estimated). |