Skip to content

scd2_file_drops

sqllocks_spindle.simulation.scd2_file_drops

SCD Type 2 file-drop simulator — generate initial full loads and daily deltas with SCD2-style versioning (valid_from / valid_to / is_current tracking).

Produces an initial snapshot followed by num_delta_days daily delta files, each containing INSERT rows for new business entities and UPDATE pairs for changed entities (expired old row + new current row).

Classes

SCD2FileDropConfig dataclass

Configuration for SCD2 file-drop simulation.

Parameters:

Name Type Description Default
domain str

Domain name used in path construction (e.g. "retail").

'default'
base_path str

Root directory for landing files.

'Files/landing'
business_key_column str

Column that identifies the business entity.

'id'
scd2_columns list[str]

Columns to track for changes.

list()
effective_date_column str

Name of the valid-from column.

'valid_from'
end_date_column str

Name of the valid-to column.

'valid_to'
is_current_column str

Name of the is-current flag column.

'is_current'
initial_load_date str

Date string for the initial snapshot (YYYY-MM-DD).

'2024-01-01'
num_delta_days int

Number of daily delta files to generate.

30
daily_change_rate float

Fraction of records that change per day.

0.05
daily_new_rate float

Fraction of new records per day (relative to initial count).

0.02
formats list[str]

File formats to write ("parquet", "csv", "jsonl").

(lambda: ['parquet'])()
manifest_enabled bool

Write a _manifest.json alongside each drop.

True
seed int

Random seed for reproducibility.

42

SCD2FileDropResult dataclass

Result of an SCD2 file-drop simulation run.

Attributes:

Name Type Description
initial_load_path Path

Path to the initial full-load file.

delta_paths list[Path]

Paths to daily delta files.

manifest_paths list[Path]

Paths to _manifest.json files.

stats dict[str, Any]

Aggregate statistics for the simulation run.

SCD2FileDropSimulator

Simulate an upstream source landing SCD2-versioned files over time.

Generates an initial full snapshot and then daily delta files containing INSERT rows (new entities) and UPDATE rows (changed entities with valid_from / valid_to / is_current tracking).

Parameters:

Name Type Description Default
tables dict[str, DataFrame]

Mapping of entity_name -> DataFrame from a generation result.

required
config SCD2FileDropConfig

:class:SCD2FileDropConfig controlling paths and simulation parameters.

required

Example::

from sqllocks_spindle.simulation.scd2_file_drops import (
    SCD2FileDropSimulator,
    SCD2FileDropConfig,
)

cfg = SCD2FileDropConfig(
    domain="retail",
    business_key_column="customer_id",
    scd2_columns=["status", "address", "tier"],
)
result = SCD2FileDropSimulator(tables=gen_result.tables, config=cfg).run()
Methods:
run()

Execute the SCD2 file-drop simulation and return results.