Skip to content

file_drop

sqllocks_spindle.simulation.file_drop

File-drop simulator — simulate upstream sources landing files over a date range.

Classes

FileDropConfig dataclass

Configuration for file-drop simulation.

Parameters:

Name Type Description Default
domain str

Domain name used in path construction (e.g. "retail").

'default'
base_path str

Root directory for landing files. Maps to Files/landing in a Fabric lakehouse.

'Files/landing'
cadence str

Drop cadence — "daily", "hourly", or "every_15m".

'daily'
date_range_start str

Inclusive start date as YYYY-MM-DD.

''
date_range_end str

Inclusive end date as YYYY-MM-DD.

''
partitioning str

Partition folder template.

'dt=YYYY-MM-DD'
formats list[str]

File formats to write ("parquet", "csv", "jsonl").

(lambda: ['parquet'])()
file_naming str

File naming template. Placeholders: {domain}, {entity}, {dt}, {seq}, {ext}.

'{domain}_{entity}_{dt}_{seq}.{ext}'
entities list[str]

Restrict simulation to these table names. Empty = all tables.

list()
manifest_enabled bool

Write a _manifest.json per partition folder.

True
done_flag_enabled bool

Write a _done sentinel per partition folder.

True
lateness_enabled bool

Inject late-arriving rows (data from previous days).

True
lateness_probability float

Per-row probability of being marked late.

0.1
max_days_late int

Maximum staleness for late rows.

3
duplicates_enabled bool

Inject duplicate rows.

False
duplicate_probability float

Per-row probability of duplication.

0.02
backfill_enabled bool

Re-drop historical partitions.

False
max_days_back int

How far back a backfill can reach.

0
seed int

Random seed for reproducibility.

42

FileDropResult dataclass

Result of a file-drop simulation run.

Attributes:

Name Type Description
files_written list[Path]

All data file paths written.

manifest_paths list[Path]

Paths to _manifest.json files.

done_flag_paths list[Path]

Paths to _done sentinel files.

stats dict[str, Any]

Per-entity statistics dict.

FileDropSimulator

Simulate an upstream source dropping files on a cadence over a date range.

For each simulated time slot the simulator
  1. Slices rows belonging to that slot (temporal column or round-robin).
  2. Writes partitioned data files to disk.
  3. Optionally writes a manifest and done-flag.
  4. Optionally injects late arrivals, duplicates, and backfills.

Parameters:

Name Type Description Default
tables dict[str, DataFrame]

Mapping of table_name -> DataFrame (from :class:~sqllocks_spindle.engine.generator.GenerationResult).

required
config FileDropConfig

:class:FileDropConfig controlling paths, cadence, and data-quality anomalies.

required

Example::

from sqllocks_spindle.simulation import FileDropSimulator, FileDropConfig

cfg = FileDropConfig(
    domain="retail",
    date_range_start="2024-01-01",
    date_range_end="2024-01-31",
)
result = FileDropSimulator(tables=gen_result.tables, config=cfg).run()
Methods:
run()

Execute the file-drop simulation and return results.