generator
sqllocks_spindle.engine.generator
¶
Main Spindle generator — the public API entry point.
Classes¶
ColumnLineage
dataclass
¶
Tracks which strategy produced a column's values.
GenerationResult
dataclass
¶
Result of a generation run.
Attributes¶
table_names
property
¶
Return list of table names in generation order.
Methods:¶
get_lineage(table, column)
¶
Look up lineage for a specific column.
verify_integrity()
¶
Verify referential integrity across all tables in parallel.
__len__()
¶
Return total row count across all tables.
__contains__(table_name)
¶
Check if a table exists in the result.
__iter__()
¶
Iterate over (table_name, DataFrame) pairs, in generation order.
Matches the documented quickstart pattern::
for table_name, df in result:
spark.createDataFrame(df).write.saveAsTable(table_name)
Note: indexing and membership are name-keyed (result["order"],
"order" in result); iteration yields pairs (like dict.items()),
not bare keys.
items()
¶
Return (table_name, DataFrame) pairs. Explicit alias for iteration.
keys()
¶
Return table names.
values()
¶
Return the DataFrames.
to_csv(output_dir, max_workers=4, **kwargs)
¶
Write all tables to CSV files in parallel. Returns list of file paths.
to_parquet(output_dir, max_workers=4, **kwargs)
¶
Write all tables to Parquet files in parallel. Requires pyarrow.
to_jsonl(output_dir)
¶
Write all tables to JSON Lines files.
to_excel(output_path)
¶
Write all tables to a single Excel file (one sheet per table). Requires openpyxl.
to_sql(output_dir, **kwargs)
¶
Write all tables as SQL INSERT files.
to_dataframe(table_name)
¶
Return the DataFrame for a given table. Alias for self[table_name].
Spindle
¶
Main entry point for Spindle data generation.
Methods:¶
estimate_memory(domain=None, schema=None, scale=None, scale_overrides=None)
¶
Estimate RAM usage in bytes per table and total.
generate(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None, enforce_correlations=True, fidelity_profile=None)
¶
Generate synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
A Domain instance (e.g., RetailDomain()) with built-in schema. |
None
|
|
schema
|
str | Path | dict | SpindleSchema | None
|
Path to .spindle.json, raw dict, or parsed SpindleSchema. |
None
|
scale
|
str | None
|
Scale preset name (small, medium, large, xlarge). |
None
|
scale_overrides
|
dict[str, int] | None
|
Override row counts for specific tables. |
None
|
seed
|
int | None
|
Random seed for reproducibility. |
None
|
on_progress
|
Callable[[str, int, int], None] | None
|
Optional callback(table_name, tables_done, tables_total). |
None
|
enforce_correlations
|
bool
|
If True (default), apply GaussianCopula post-pass when correlated_columns metadata is present in the schema. |
True
|
fidelity_profile
|
Optional DatasetProfile. When provided, returns a (GenerationResult, FidelityReport) tuple instead of GenerationResult. |
None
|
generate_stream(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, on_progress=None)
¶
Generate synthetic data and yield each table as it completes.
Same signature as generate(). Yields (table_name, DataFrame)
tuples in dependency order, allowing callers to write table N to a
store while table N+1 is still being generated.
Example::
for table_name, df in spindle.generate_stream(domain=RetailDomain(), scale="medium"):
writer.write(table_name, df)
describe(domain=None, schema=None)
¶
Parse and return schema without generating data.
Functions:¶
calculate_row_counts(schema, overrides=None)
¶
Return per-table row counts derived from the schema's generation config.
Exposed at module level so ScaleRouter and ChunkWorker can use it without instantiating a full Spindle object.
apply_compute_phase(tables, schema)
¶
Module-level helper so ChunkWorker can call this without a Spindle instance.