Tutorial 08: Validation Gates¶

Run automated quality checks against your data, catch violations before they reach downstream consumers, and quarantine bad records for investigation.

Prerequisites¶

Completed Tutorial 07: Chaos Engineering (or equivalent experience)
Familiarity with Spindle.generate() and domain objects
Understanding of data quality concerns (nulls, duplicates, orphaned FKs)

What You'll Learn¶

What validation gates are and how the gate framework works
How to run individual gates: ReferentialIntegrityGate, NullConstraintGate, UniqueConstraintGate, SchemaConformanceGate
How to orchestrate multiple gates with GateRunner
How to interpret GateResult objects
How to quarantine failed records with QuarantineManager
How to combine chaos injection with validation for end-to-end testing

Time Estimate¶

~15 minutes

The Gate Framework¶

A validation gate is a single, focused check that answers one question about your data:

Gate	Question It Answers
`ReferentialIntegrityGate`	Do all foreign keys point to valid parent records?
`NullConstraintGate`	Are NOT NULL columns free of nulls?
`UniqueConstraintGate`	Are unique/primary key columns actually unique?
`SchemaConformanceGate`	Do the actual columns match the expected schema?

Gates are the safety net between data generation (or ingestion) and consumption. They catch problems early -- before bad data poisons your dashboards, ML models, or reports.

Step 1 -- Generate Clean Data and Create a Validation Context¶

Start by generating clean data and wrapping it in a ValidationContext. Clean data should pass all gates -- verifying this first confirms the gates do not produce false positives:

from sqllocks_spindle import Spindle, RetailDomain
from sqllocks_spindle.validation import (
    GateRunner,
    ValidationContext,
    ReferentialIntegrityGate,
    NullConstraintGate,
    UniqueConstraintGate,
)

result = Spindle().generate(domain=RetailDomain(), scale="small", seed=42)

# Create a validation context
ctx = ValidationContext(tables=result.tables, schema=result.schema)
print(f"Validation context created with {len(ctx.tables)} tables.")

The ValidationContext bundles your tables and schema metadata together so each gate has everything it needs.

Step 2 -- Run the Referential Integrity Gate¶

This gate checks every foreign-key relationship defined in the schema and verifies that all child records reference valid parent records:

ri_gate = ReferentialIntegrityGate()
ri_result = ri_gate.check(ctx)

print(f"Status: {'PASS' if ri_result.passed else 'FAIL'}")
print(f"Checks performed: {ri_result.checks_performed}")
print(f"Violations: {ri_result.violation_count}")

Orphaned records (child rows with no matching parent) cause silent data loss in JOINs. This gate catches them before they reach your queries.

Step 3 -- Run the Null Constraint Gate¶

This gate checks that columns marked as NOT NULL in the schema contain no null values:

null_gate = NullConstraintGate()
null_result = null_gate.check(ctx)

print(f"Status: {'PASS' if null_result.passed else 'FAIL'}")
print(f"Checks performed: {null_result.checks_performed}")
print(f"Violations: {null_result.violation_count}")

Unexpected nulls are the most common data quality issue. They cause calculation errors, broken aggregations, and misleading visualizations.

Step 4 -- Run the Unique Constraint Gate¶

This gate verifies that primary key and unique columns contain no duplicate values:

unique_gate = UniqueConstraintGate()
unique_result = unique_gate.check(ctx)

print(f"Status: {'PASS' if unique_result.passed else 'FAIL'}")
print(f"Checks performed: {unique_result.checks_performed}")
print(f"Violations: {unique_result.violation_count}")

Duplicate keys cause incorrect JOIN fan-outs, double-counting in aggregations, and subtle bugs that are hard to trace.

Step 5 -- Run the Schema Conformance Gate¶

This gate checks whether the actual table columns match the expected schema -- catching schema drift before it causes downstream failures:

from sqllocks_spindle.validation.gates import SchemaConformanceGate

sc_gate = SchemaConformanceGate()
sc_result = sc_gate.check(ctx)

print(f"Status: {'PASS' if sc_result.passed else 'FAIL'}")

Step 6 -- Orchestrate with GateRunner¶

Instead of running gates individually, use the GateRunner to execute all gates in sequence and produce a unified report. This is the pattern you would use in a production pipeline:

runner = GateRunner(
    gates=[ReferentialIntegrityGate(), NullConstraintGate(), UniqueConstraintGate()]
)

runner_result = runner.run(ctx)

print(f"Overall status: {'PASS' if runner_result.all_passed else 'FAIL'}")
print(f"Gates run: {runner_result.gates_run}")
print(f"Gates passed: {runner_result.gates_passed}")
print(f"Gates failed: {runner_result.gates_failed}")

for gate_name, gate_result in runner_result.results.items():
    status = "PASS" if gate_result.passed else "FAIL"
    print(f"  {gate_name}: {status} ({gate_result.violation_count} violations)")

The gate runner gives you a single pass/fail decision for your entire dataset. It also collects all violations across all gates, making it easy to generate a quality report or trigger alerts.

Step 7 -- Inject Chaos and Watch Gates Catch It¶

Now for the payoff: intentionally corrupt the data, then run the gates again. This proves your quality checks work end-to-end:

from sqllocks_spindle.chaos import ChaosConfig, ChaosEngine

# Apply chaos to create bad data
chaos_config = ChaosConfig(enabled=True, intensity="stormy", seed=99)
engine = ChaosEngine(chaos_config)

corrupted_tables = {}
for name, df in result.tables.items():
    corrupted_tables[name] = engine.corrupt_dataframe(df.copy(), day=5)

# Create a new context with corrupted data
corrupted_ctx = ValidationContext(tables=corrupted_tables, schema=result.schema)

# Run all gates
corrupted_result = runner.run(corrupted_ctx)

print(f"Overall status: {'PASS' if corrupted_result.all_passed else 'FAIL'}")
print(f"Gates passed: {corrupted_result.gates_passed}")
print(f"Gates failed: {corrupted_result.gates_failed}")

for gate_name, gate_result in corrupted_result.results.items():
    status = "PASS" if gate_result.passed else "FAIL"
    print(f"  {gate_name}: {status} ({gate_result.violation_count} violations)")
    if not gate_result.passed and gate_result.details:
        for detail in gate_result.details[:3]:
            print(f"    -> {detail}")

Step 8 -- Quarantine Bad Records¶

Quarantining is better than rejecting an entire batch. Good records flow through to downstream consumers, while bad records go to a quarantine area for investigation.

Manual Quarantine by Splitting DataFrames¶

You can split a DataFrame into clean and quarantined subsets based on gate results:

customers_corrupted = corrupted_tables["customers"]
customers_original = result.tables["customers"]

# Identify NOT NULL columns from the original schema
not_null_cols = [col for col in customers_original.columns
                 if customers_original[col].isnull().sum() == 0]

# Split into clean and quarantine
has_violation = customers_corrupted[not_null_cols].isnull().any(axis=1)
clean_df = customers_corrupted[~has_violation].copy()
quarantine_df = customers_corrupted[has_violation].copy()

print(f"Total records: {len(customers_corrupted)}")
print(f"Clean records: {len(clean_df)}")
print(f"Quarantined records: {len(quarantine_df)}")

QuarantineManager for Persistent Storage¶

For production workflows, use QuarantineManager to write quarantined records to a dedicated folder with metadata:

from sqllocks_spindle.validation.quarantine import QuarantineManager
from pathlib import Path

QUARANTINE_DIR = Path("./demo_quarantine")
RUN_ID = "run_20250115_001"

qm = QuarantineManager(domain="retail")

# Quarantine the failed table
path = qm.quarantine_dataframe(
    df=corrupted_tables["order"],
    quarantine_root=QUARANTINE_DIR,
    run_id=RUN_ID,
    table_name="order",
    reason="Referential integrity violations detected",
    gate_name="referential_integrity",
    fmt="parquet",
)
print(f"Quarantined order table -> {path}")

# List all quarantined items
inventory = qm.list_quarantined(QUARANTINE_DIR)
print(f"Items in quarantine: {len(inventory)}")
for item in inventory:
    print(f"  - {item}")

Step 9 -- Inspect GateResult Properties¶

Each GateResult provides structured information about what passed, what failed, and why:

print(f"gate_name: {ri_bad.gate_name}")
print(f"passed:    {ri_bad.passed}")
print(f"errors:    {ri_bad.errors[:2]}")
print(f"warnings:  {ri_bad.warnings[:2]}")
print(f"details:   {ri_bad.details}")

Use these properties to build automated alerting, write quality reports, or drive remediation workflows.

Run It Yourself

Notebook: T15_validation_gates.ipynb

Script: 18_validation_gates.py

Validation Guide -- full reference for all gate types, custom gates, and quarantine configuration

Next Step¶

Continue to Tutorial 09: Composite Domains to learn how to generate multi-domain datasets with cross-domain foreign-key relationships.