Skip to content

chunked_generator

sqllocks_spindle.engine.chunked_generator

Chunked generation engine for billion-row scale.

Two-pass approach: 1. Generate ALL parent/dimension/reference tables fully (in-memory). 2. For each child table in dependency order, yield chunk_size rows at a time. Each chunk shares the same IDManager so FK references are valid.

Classes

ChunkedGenerationResult dataclass

Result of a chunked generation run.

Parent tables are fully materialized (small). Child tables are available only via iter_chunks() to keep memory bounded.

Methods:
iter_chunks(table_name)

Yield DataFrames of chunk_size rows for a child table.

Must be called in dependency order. Each table can only be iterated once.

write_with(writer, **kwargs)

Convenience: write parent tables, then stream child chunks through a writer.

The writer must implement either
  • write_table(table_name, df, **kwargs) for individual DataFrames, or
  • stage_chunk(table_name, chunk_df, idx) + copy_into(table_name) for bulk writers.

ChunkedSpindle

Generate billion-row datasets in bounded memory.

Uses a two-pass approach: 1. Parent tables generated fully in-memory (typically small). 2. Child tables generated in chunks of chunk_size rows.

Example::

cs = ChunkedSpindle()
result = cs.generate_chunked(
    domain=FinancialDomain(),
    scale="warehouse",
    chunk_size=1_000_000,
)

# Parent tables are immediately available
for name, df in result.parent_tables.items():
    print(f"{name}: {len(df)} rows")

# Child tables stream via iterator
for table_name in result.child_table_names:
    for chunk in result.iter_chunks(table_name):
        writer.write(chunk)
Methods:
generate_chunked(domain=None, schema=None, scale=None, scale_overrides=None, seed=None, chunk_size=1000000, target_table=None, target_count=None)

Generate data with chunked child tables.

Parameters:

Name Type Description Default
domain

A Domain instance.

None
schema Any

Path to .spindle.json, raw dict, or parsed SpindleSchema.

None
scale str | None

Scale preset name.

None
scale_overrides dict[str, int] | None

Override row counts for specific tables.

None
seed int | None

Random seed for reproducibility.

None
chunk_size int

Rows per chunk for child tables.

1000000
target_table str | None

Anchor table name — derive all other table counts proportionally from this table's target_count.

None
target_count int | None

Number of rows for the anchor table. Required when target_table is provided.

None

Returns:

Type Description
ChunkedGenerationResult

ChunkedGenerationResult with parent tables materialized and

ChunkedGenerationResult

child tables available via iter_chunks().