Skip to content

validation

sqllocks_spindle.validation

Validation gates and quarantine for Spindle data generation.

Classes

DistributionGate

Bases: ValidationGate

Check that numeric and enum columns match schema-declared distributions.

For columns with strategy="distribution": runs a Kolmogorov-Smirnov (KS) test comparing the actual data to the fitted scipy distribution from the schema. KS p-value < alpha produces a warning (not an error — distribution drift is expected at scale; hard failures are reserved for broken data).

For columns with strategy="enum": runs a chi-squared test comparing observed category frequencies to expected probabilities. Missing expected values produce a warning.

Requires scipy. Skips gracefully (with a warning) when scipy is not installed. Configure significance threshold via context.config["distribution_alpha"] (default 0.05).

FileFormatGate

Bases: ValidationGate

Validate output files are readable, correct format, and not truncated.

Checks parquet, CSV, and JSONL files. Takes file paths from context.file_paths.

GateResult dataclass

Result from a single validation gate check.

GateRunner

Run validation gates against a context and collect results.

Methods:
available_gates() staticmethod

Return names of all registered gates.

register_gate(name, gate_cls) staticmethod

Register a custom gate in the global registry.

run_all(context)

Run all configured gates and return results.

run_gate(gate_name, context)

Run a single gate by name.

summary(results) staticmethod

Produce an aggregate summary of gate results.

NullConstraintGate

Bases: ValidationGate

Check that non-nullable columns have no null values.

RangeConstraintGate

Bases: ValidationGate

Check that numeric columns are within expected ranges.

Configure via context.config with a dict of: { "ranges": { "table_name.column_name": {"min": 0, "max": 100}, ... } }

ReferentialIntegrityGate

Bases: ValidationGate

Check that all FK relationships hold across tables.

Every FK value in a child column must exist in the referenced parent PK column. Reports orphan counts per relationship.

SchemaConformanceGate

Bases: ValidationGate

Check that DataFrames match the expected schema.

Validates column names are present, data types are compatible, and no unexpected columns exist. Uses the SpindleSchema from context or an expected_schema dict from config.

SchemaDriftGate

Bases: ValidationGate

Detect schema drift between current data and a baseline schema.

Detects: - Additive changes (new columns, new tables) - Breaking changes (removed columns, renamed columns, retyped columns)

Configure via context.config: { "baseline": { "table_name": { "columns": {"col1": "int64", "col2": "object", ...} }, ... } }

TemporalConsistencyGate

Bases: ValidationGate

Check temporal consistency of date/datetime columns.

Validates: - Dates are within expected range (configurable) - No unexpected future dates - Temporal ordering (e.g., end_date >= start_date)

Configure via context.config: { "date_range": {"start": "2020-01-01", "end": "2025-12-31"}, "no_future": ["table.column", ...], "ordering": [ {"table": "orders", "start": "order_date", "end": "ship_date"}, ... ] }

UniqueConstraintGate

Bases: ValidationGate

Check that primary key columns have no duplicate values.

ValidationContext dataclass

Context passed to each validation gate.

ValidationGate

Bases: ABC

Abstract base class for all validation gates.

Methods:
check(context) abstractmethod

Run this gate's validation checks against the given context.

QuarantineEntry dataclass

Metadata for a single quarantined artifact.

QuarantineManager

Move or copy failed artifacts to a quarantine directory.

Quarantine directory layout::

<quarantine_root>/<domain>/<run_id>/
    <filename>
    <filename>._quarantine_meta.json
Methods:
quarantine_file(source_path, quarantine_root, run_id, reason, gate_name='unknown')

Copy a file into the quarantine directory with metadata.

Returns the path to the quarantined copy.

quarantine_dataframe(df, quarantine_root, run_id, table_name, reason, gate_name='unknown', fmt='parquet')

Write a DataFrame to quarantine with metadata.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to quarantine.

required
quarantine_root str | Path

Root quarantine directory.

required
run_id str

Unique identifier for the generation run.

required
table_name str

Logical table name.

required
reason str

Why this artifact was quarantined.

required
gate_name str

Which validation gate triggered quarantine.

'unknown'
fmt str

Output format — "parquet", "csv", or "jsonl".

'parquet'

Returns the path to the quarantined file.

list_quarantined(quarantine_root)

List all quarantined items across all domains and runs.

Returns a list of dicts with quarantine metadata.

get_quarantine_report(quarantine_root, run_id)

Get a detailed report for a specific run's quarantined artifacts.

Returns a dict with run-level summary and per-artifact details.