Tutorial 14: Scenario Packs¶

Run pre-built YAML-defined end-to-end data generation workflows that bundle domain, scale, simulation mode, chaos, validation, and Fabric targets into a single declarative file.

Prerequisites¶

Python 3.10 or later
pip install sqllocks-spindle
Completed Tutorial 13: Medallion Architecture
Familiarity with YAML syntax

What You'll Learn¶

How to browse and inspect Spindle's 44 built-in scenario packs
How to validate a pack against a domain before running it
How to run file-drop, streaming, and hybrid packs with PackRunner
How to run a schema-drift pack with chaos injection
How to write custom packs from inline YAML
How to batch-run all packs for a domain (useful for CI)

What Are Scenario Packs?¶

A scenario pack is a YAML file that bundles an entire data generation workflow:

Domain + scale -- which data to generate and how much
Simulation mode -- file_drop, stream, or hybrid
Chaos configuration -- optional schema drift, orphan FKs, etc.
Validation gates -- which quality checks to run
Fabric target paths -- where to write output

Spindle ships 44 built-in packs covering 11 industry verticals and 4 simulation types:

Type	Description
`fd_daily_batch`	Daily file-drop with partitioning, manifest, done flag
`fd_schema_drift`	File-drop with chaos-injected schema drift
`st_realtime_events`	Pure streaming via EventEnvelope
`hy_stream_plus_microbatch`	Hybrid: batch files + stream events

Verticals: retail, healthcare, financial, supply_chain, iot, hr, insurance, marketing, education, real_estate, manufacturing

Step 1: Browse Built-in Packs¶

Use list_builtin() to see every pack that ships with Spindle:

import pandas as pd
from sqllocks_spindle.packs.loader import PackLoader, list_builtin, _BUILTIN_PACKS_ROOT

packs = list_builtin()
print(f"Total built-in packs: {len(packs)}")

pack_index = pd.DataFrame([
    {"domain": p.domain, "kind": p.kind, "id": p.id, "description": p.description}
    for p in packs
]).sort_values(["domain", "kind"])
pack_index

This returns a DataFrame with all 44 packs, sorted by domain and simulation type.

Step 2: Inspect a Pack¶

Load a specific pack from disk and examine its configuration:

from pathlib import Path

retail_pack_path = Path(_BUILTIN_PACKS_ROOT) / "retail" / "fd_daily_batch.yaml"
pack = PackLoader().load(retail_pack_path)

print(f"ID:          {pack.id}")
print(f"Kind:        {pack.kind}")
print(f"Domain:      {pack.domain}")
print(f"Description: {pack.description}")
print(f"Version:     {pack.pack_version}")

if pack.file_drop:
    print(f"\nFile-drop config:")
    print(f"  cadence:  {pack.file_drop.cadence}")
    print(f"  formats:  {pack.file_drop.formats}")

if pack.validation:
    print(f"\nValidation gates: {pack.validation.required_gates}")

Step 3: Validate a Pack Against a Domain¶

Before running, verify that a pack's configuration is compatible with the target domain. This catches issues like referencing tables that do not exist in the domain schema.

from sqllocks_spindle.packs.validator import PackValidator
from sqllocks_spindle.domains.retail.retail import RetailDomain

vr = PackValidator().validate(pack, RetailDomain())
print(f"Valid:    {vr.is_valid}")
print(f"Errors:   {vr.errors}")
print(f"Warnings: {vr.warnings}")

Step 4: Run a File-Drop Pack¶

The PackRunner orchestrates the full workflow: generate data, simulate the delivery pattern, validate, and write output.

from sqllocks_spindle.packs.runner import PackRunner

runner = PackRunner()

run_result = runner.run(
    pack=pack,
    domain=RetailDomain(),
    scale="fabric_demo",
    seed=42,
    base_path="/lakehouse/default/Files",
)

print(run_result.summary())
print(f"\nSuccess:       {run_result.is_success}")
print(f"Files written: {len(run_result.files_written)}")
print(f"Events:        {run_result.events_emitted:,}")
print(f"Elapsed:       {run_result.elapsed_time:.2f}s")

Step 5: Run a Schema-Drift Pack (Chaos Enabled)¶

The fd_schema_drift pack includes chaos injection. When you run it, the runner injects schema drift into the generated data and then runs validation gates to detect the issues.

drift_pack = PackLoader().load(
    Path(_BUILTIN_PACKS_ROOT) / "retail" / "fd_schema_drift.yaml"
)
print(f"Pack: {drift_pack.id}")
if drift_pack.failure_injection:
    print(f"Failure injection enabled: {drift_pack.failure_injection.enabled}")

drift_result = runner.run(
    pack=drift_pack,
    domain=RetailDomain(),
    scale="fabric_demo",
    seed=99,
    base_path="/lakehouse/default/Files",
)

print(f"\nSuccess:          {drift_result.is_success}")
print(f"Validation gates: {drift_result.validation_results}")
print(f"Errors:           {drift_result.errors}")

Step 6: Run a Streaming Pack¶

Streaming packs generate events instead of files:

stream_pack = PackLoader().load(
    Path(_BUILTIN_PACKS_ROOT) / "retail" / "st_realtime_events.yaml"
)

stream_result = runner.run(
    pack=stream_pack,
    domain=RetailDomain(),
    scale="fabric_demo",
    seed=42,
    base_path="/lakehouse/default/Files",
)

print(f"Success:        {stream_result.is_success}")
print(f"Events emitted: {stream_result.events_emitted:,}")

Step 7: Run a Hybrid Pack¶

Hybrid packs produce both batch files and streaming events simultaneously:

hybrid_pack = PackLoader().load(
    Path(_BUILTIN_PACKS_ROOT) / "retail" / "hy_stream_plus_microbatch.yaml"
)

hybrid_result = runner.run(
    pack=hybrid_pack,
    domain=RetailDomain(),
    scale="fabric_demo",
    seed=42,
    base_path="/lakehouse/default/Files",
)

print(f"Success:        {hybrid_result.is_success}")
print(f"Files written:  {len(hybrid_result.files_written)}")
print(f"Events emitted: {hybrid_result.events_emitted:,}")

Step 8: Batch-Run All Packs for a Domain¶

A useful pattern for CI: loop all 4 pack types for a given domain and report which pass.

domain = RetailDomain()
domain_name = "retail"
pack_types = [
    "fd_daily_batch", "fd_schema_drift",
    "st_realtime_events", "hy_stream_plus_microbatch",
]

report_rows = []
for pack_type in pack_types:
    p = PackLoader().load(
        Path(_BUILTIN_PACKS_ROOT) / domain_name / f"{pack_type}.yaml"
    )
    r = runner.run(
        pack=p, domain=domain, scale="fabric_demo", seed=42,
        base_path="/lakehouse/default/Files",
    )
    report_rows.append({
        "pack":    pack_type,
        "success": r.is_success,
        "files":   len(r.files_written),
        "events":  r.events_emitted,
        "elapsed": round(r.elapsed_time, 2),
        "errors":  r.errors,
    })

report_df = pd.DataFrame(report_rows)
report_df

Step 9: Custom Pack from Inline YAML¶

Write your own pack spec for one-off testing scenarios:

import tempfile, textwrap

CUSTOM_PACK_YAML = textwrap.dedent("""\
    pack_version: 1
    id: my_custom_pack
    kind: file_drop
    domain: retail
    description: Custom daily batch for demo

    fabric_targets:
      lakehouse_files_root: Files/landing/retail

    file_drop:
      cadence: daily
      partitioning: dt=YYYY-MM-DD
      formats: [parquet]
      entities: [customer, order]
      manifest:
        enabled: true
      done_flag:
        enabled: true
      lateness:
        enabled: false
      duplicates:
        enabled: false

    validation:
      required_gates: [schema_conformance]
""")

with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
    f.write(CUSTOM_PACK_YAML)
    tmp_path = f.name

custom_pack = PackLoader().load(tmp_path)
custom_result = runner.run(
    pack=custom_pack,
    domain=RetailDomain(),
    scale="fabric_demo",
    seed=42,
    base_path="/lakehouse/default/Files",
)

print(f"Pack ID:       {custom_pack.id}")
print(f"Success:       {custom_result.is_success}")
print(f"Files written: {len(custom_result.files_written)}")

For production use, save your custom pack as a .yaml file in your repository and load it with PackLoader().load("path/to/my_pack.yaml").

Run It Yourself

Notebook: 08_scenario_packs.ipynb

Script: 19_scenario_packs.py

Simulation guide -- the condensed reference for simulation modes, file-drop cadences, and streaming configuration

Next Step¶

Tutorial 15: GSL Specs -- define reproducible generation pipelines in a single YAML file that ties together schema, chaos, outputs, and validation.