chunked_generator
sqllocks_spindle.engine.chunked_generator
¶
Chunked generation engine for billion-row scale.
Two-pass approach: 1. Generate ALL parent/dimension/reference tables fully (in-memory). 2. For each child table in dependency order, yield chunk_size rows at a time. Each chunk shares the same IDManager so FK references are valid.
Classes¶
ChunkedGenerationResult
dataclass
¶
Result of a chunked generation run.
Parent tables are fully materialized (small). Child tables are available
only via iter_chunks() to keep memory bounded.
Methods:¶
iter_chunks(table_name)
¶
Yield DataFrames of chunk_size rows for a child table.
Must be called in dependency order. Each table can only be iterated once.
write_with(writer, **kwargs)
¶
Convenience: write parent tables, then stream child chunks through a writer.
The writer must implement either
write_table(table_name, df, **kwargs)for individual DataFrames, orstage_chunk(table_name, chunk_df, idx)+copy_into(table_name)for bulk writers.
ChunkedSpindle
¶
Generate billion-row datasets in bounded memory.
Uses a two-pass approach:
1. Parent tables generated fully in-memory (typically small).
2. Child tables generated in chunks of chunk_size rows.
Example::
cs = ChunkedSpindle()
result = cs.generate_chunked(
domain=FinancialDomain(),
scale="warehouse",
chunk_size=1_000_000,
)
# Parent tables are immediately available
for name, df in result.parent_tables.items():
print(f"{name}: {len(df)} rows")
# Child tables stream via iterator
for table_name in result.child_table_names:
for chunk in result.iter_chunks(table_name):
writer.write(chunk)
Methods:¶
generate_chunked(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, chunk_size=1000000, target_table=None, target_count=None)
¶
Generate data with chunked child tables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
A Domain instance. |
None
|
|
schema
|
Any
|
Path to .spindle.json, raw dict, or parsed SpindleSchema. |
None
|
scale
|
str | None
|
Scale preset name. |
None
|
scale_overrides
|
dict[str, int] | None
|
Override row counts for specific tables. |
None
|
seed
|
int | None
|
Random seed for reproducibility. |
None
|
chunk_size
|
int
|
Rows per chunk for child tables. |
1000000
|
target_table
|
str | None
|
Anchor table name — derive all other table counts proportionally from this table's target_count. |
None
|
target_count
|
int | None
|
Number of rows for the anchor table. Required when target_table is provided. |
None
|
Returns:
| Type | Description |
|---|---|
ChunkedGenerationResult
|
ChunkedGenerationResult with parent tables materialized and |
ChunkedGenerationResult
|
child tables available via iter_chunks(). |