Incremental & Day 2 Generation¶

Generate realistic Day 2 data: CDC deltas (inserts, updates, deletes) and monthly point-in-time snapshots with configurable growth, churn, and seasonality.

Continue: CDC Deltas¶

Generate incremental changes from existing data — tagged with _delta_type and _delta_timestamp for downstream merge logic.

Quick Start¶

# Generate baseline
spindle generate retail --scale small --format csv --output ./day1/

# Generate Day 2 deltas
spindle continue retail --input ./day1/ --output ./deltas/ --inserts 100 --seed 42

from sqllocks_spindle import Spindle, RetailDomain, ContinueEngine, ContinueConfig

# Day 1 baseline
result = Spindle().generate(domain=RetailDomain(), scale="small", seed=42)

# Day 2 deltas
engine = ContinueEngine()
deltas = engine.continue_from(
    existing=result,
    config=ContinueConfig(
        insert_count=100,
        update_fraction=0.10,
        delete_fraction=0.02,
        state_transitions={
            "order.status": {
                "pending":   {"shipped": 0.7, "cancelled": 0.3},
                "shipped":   {"delivered": 0.9, "returned": 0.1},
            },
        },
        seed=42,
    ),
)

print(deltas.summary())
# Inserts: 100, Updates: 500, Deletes: 100

# Access combined deltas with metadata columns
combined = deltas.combined["order"]
print(combined[["order_id", "_delta_type", "_delta_timestamp"]].head())

Delta Semantics¶

Delta Type	Behavior
INSERT	New rows with PKs offset above existing max. FKs reference both existing and newly-inserted parents.
UPDATE	Existing rows with Markov state transitions applied first, then non-key columns perturbed (±10% numeric, shuffled categorical).
DELETE	Existing rows returned as-is (soft delete). Full row preserved with `_delta_type="DELETE"`.

Metadata Columns¶

Every delta row gets two columns:

Column	Description
`_delta_type`	`"INSERT"`, `"UPDATE"`, or `"DELETE"`
`_delta_timestamp`	Batch-level timestamp (`pd.Timestamp.now()`)

State Transitions (Markov)¶

Model realistic business workflow evolution:

state_transitions={
    "order.status": {
        "pending":   {"shipped": 0.7, "cancelled": 0.3},
        "shipped":   {"delivered": 0.9, "returned": 0.1},
    },
    "claim.status": {
        "submitted": {"under_review": 0.8, "rejected": 0.2},
        "under_review": {"approved": 0.7, "rejected": 0.3},
    },
}

Rows with no matching transition rule are left unchanged.

ContinueConfig Options¶

ContinueConfig(
    insert_count=100,               # New rows per anchor table
    update_fraction=0.1,            # 10% of existing rows updated
    delete_fraction=0.02,           # 2% of existing rows soft-deleted
    state_transitions={...},        # Markov state evolution
    timestamp_column="_delta_timestamp",  # Customizable column name
    delta_type_column="_delta_type",      # Customizable column name
    seed=42,                        # Reproducibility
)

CLI Reference¶

spindle continue DOMAIN_NAME [OPTIONS]

Option	Default	Description
`DOMAIN_NAME`	—	Domain to continue (required)
`--input`	—	Directory with existing data files (required)
`--output, -o`	—	Output directory for delta files
`--format`	`csv`	Output format: `csv`, `parquet`, `jsonl`
`--inserts`	`100`	New rows per anchor table
`--update-fraction`	`0.1`	Fraction of rows to update (0.0–1.0)
`--delete-fraction`	`0.02`	Fraction of rows to soft-delete (0.0–1.0)
`--seed`	`42`	Random seed

Time-Travel: Monthly Snapshots¶

Generate N+1 monthly point-in-time snapshots with configurable growth, churn, and seasonality — ideal for testing Lakehouse temporal queries and partitioned storage.

Quick Start¶

spindle time-travel retail --months 12 --output ./snapshots/ --format parquet \
  --growth-rate 0.08 --churn-rate 0.03

from sqllocks_spindle import Spindle, RetailDomain, TimeTravelEngine, TimeTravelConfig

engine = TimeTravelEngine()
result = engine.generate(
    domain=RetailDomain(),
    config=TimeTravelConfig(
        months=12,
        start_date="2025-01-01",
        growth_rate=0.05,        # 5% net growth per month
        churn_rate=0.02,         # 2% soft-delete per month
        seasonality={11: 1.5, 12: 2.0},  # Holiday spike
        seed=42,
    ),
    scale="small",
)

print(result.summary())

# Access individual snapshots
jan = result.get_snapshot(0)
dec = result.get_snapshot(12)
print(f"Jan customers: {jan.row_counts['customer']}")
print(f"Dec customers: {dec.row_counts['customer']}")

Monthly Evolution¶

Each month applies three transformations per table:

Growth — add len(df) * growth_rate * seasonality_multiplier new rows
Churn — remove len(df) * churn_rate rows
Updates — perturb len(df) * update_fraction numeric columns by ±10%

Partitioned Output¶

Combine all snapshots into single DataFrames with a _snapshot_date partition column:

partitioned = result.to_partitioned_dfs()

for table_name, df in partitioned.items():
    df.to_parquet(f"output/{table_name}.parquet")
    # Each row has _snapshot_date column for downstream partitioning

TimeTravelConfig Options¶

TimeTravelConfig(
    months=12,                      # Number of monthly snapshots (produces months+1 total)
    start_date="2025-01-01",        # Month 0 date
    growth_rate=0.05,               # 5% net growth per month
    seasonality={11: 1.5, 12: 2.0}, # Month → multiplier (default 1.0)
    churn_rate=0.02,                # 2% soft-delete per month
    update_fraction=0.1,            # 10% of rows modified per month
    seed=42,                        # Reproducibility
)

CLI Reference¶

spindle time-travel DOMAIN_NAME [OPTIONS]

Option	Default	Description
`DOMAIN_NAME`	—	Domain to snapshot (required)
`--months`	`12`	Number of monthly snapshots
`--scale, -s`	`small`	Scale preset
`--output, -o`	—	Output directory (required)
`--format`	`parquet`	Output format: `csv`, `parquet`
`--growth-rate`	`0.05`	Monthly growth rate (0.0–1.0)
`--churn-rate`	`0.02`	Monthly churn rate (0.0–1.0)
`--seed`	`42`	Random seed

Output Structure (CLI)¶

./snapshots/
  month_0/
    customer.parquet
    order.parquet
  month_1/
    customer.parquet
    order.parquet
  ...
  month_12/