Schema Learning¶

Spindle's inference pipeline lets you profile existing data, generate synthetic copies, compare fidelity, and mask PII — all from the CLI or Python API.

Pipeline Overview¶

Real Data (CSV/Parquet)
    │
    ▼
spindle learn     →  .spindle.json schema
    │
    ▼
spindle generate  →  Synthetic data
    │
    ▼
spindle compare   →  Fidelity report (real vs synthetic)
    │
    ▼
spindle mask      →  Masked copy of real data

Learn: Infer Schema from Data¶

Profile existing CSV, Parquet, or JSONL files and produce a .spindle.json schema with fitted distributions.

Quick Start¶

spindle learn ./real_data/ --format csv --output inferred.spindle.json --domain my_retail

from sqllocks_spindle.inference import DataProfiler, SchemaBuilder
import pandas as pd

# Option A: profile a dict of DataFrames
profiler = DataProfiler()
profile = profiler.profile_dataset({
    "customer": pd.read_csv("customer.csv"),
    "order": pd.read_csv("order.csv"),
})

# Option B: profile a single DataFrame (alias for profile_dataset)
profile = profiler.profile({"orders": orders_df})

# Option C: convenience classmethod for CSV files
profile = DataProfiler.from_csv("orders.csv")
profile = DataProfiler.from_csv("orders.csv", sample_rows=50_000)

builder = SchemaBuilder()
schema = builder.build(profile, domain_name="my_retail")

# Build with correlation detection and an auto-suggested AnomalyRegistry
schema, suggested_registry = SchemaBuilder().build(
    profile,
    domain_name="my_retail",
    fit_threshold=0.80,          # columns with KS fit < 0.80 → empirical strategy
    correlation_threshold=0.5,   # pairs with |r| >= 0.5 → included in copula config
    include_anomaly_registry=True,
)

DataProfiler Configuration¶

DataProfiler(
    fit_threshold=0.80,      # KS fit score below this → use empirical strategy instead of parametric
    top_n_values=500,        # max cardinality tracked in value_counts for enum detection
    outlier_iqr_factor=1.5,  # IQR multiplier for outlier detection (standard = 1.5)
    sample_rows=None,        # None = full scan; int = random sample
)

What Gets Detected¶

Feature	Detection Method	Threshold
Column types	pandas dtype + heuristics	Integer, float, boolean, date, datetime, string
Null rates	Count / total rows	Exact
Primary keys	Unique + non-null + naming hints (`_id`, `pk`)	100% unique
Foreign keys	Naming convention (`*_id`) + 90% value overlap with parent PK	Cross-table
Enums	Cardinality < 50 or ratio < 5%	With weighted probabilities
Distributions	KS test (scipy): normal, uniform, exponential, lognormal	p-value > 0.05
Quantile fingerprint	P1/P5/P10/P25/P50/P75/P90/P95/P99	Empirical fallback when KS fit < threshold
Outlier rate	IQR method (1.5× IQR)	Calibrates AnomalyRegistry
String length	min/mean/max/p95	Bounds string generation
Temporal histograms	24-bin hour + 7-bin day-of-week	Drives temporal strategy profiles
Column correlations	Pearson (numeric columns only)	Gaussian copula post-pass
String patterns	Regex: email, UUID, phone, date, SSN, IP (v4+v6), MAC, IBAN, currency code, language code, postal code	90% match rate

CLI Reference¶

spindle learn INPUT_PATH [OPTIONS]

Option	Default	Description
`INPUT_PATH`	—	Directory or single file (required)
`--output, -o`	—	Output `.spindle.json` path
`--format`	`csv`	Input format: `csv`, `parquet`, `jsonl`
`--domain`	`inferred`	Domain name for the generated schema

Compare: Fidelity Report¶

Compare real data against synthetic data and produce a scored fidelity report.

Quick Start¶

spindle compare ./real_data/ ./synthetic_data/ --format csv --output report.md

from sqllocks_spindle.inference.comparator import FidelityComparator, FidelityReport

# Option A: compare two DataFrames directly
report = FidelityReport.score(real_df, synthetic_df, table_name="orders")
report.summary()                       # prints per-column score table
failing = report.failing_columns()     # list of (table, column, score) tuples below threshold
df_scores = report.to_dataframe()      # pandas DataFrame for downstream analysis
report_dict = report.to_dict()         # serializable dict

# Option B: compare full datasets (multi-table)
comp = FidelityComparator()
report = comp.compare(
    real={"customer": real_df, "order": real_order_df},
    synthetic={"customer": synth_df, "order": synth_order_df},
)

print(f"Overall fidelity: {report.overall_score:.1f}/100")
print(report.to_markdown())

Inline Fidelity During Generation¶

Pass fidelity_profile to Spindle.generate() to get a FidelityReport in the same call:

from sqllocks_spindle import Spindle
from sqllocks_spindle.inference import DataProfiler, SchemaBuilder, ProfileIO

# Profile real data once and save
profiler = DataProfiler()
profile = profiler.profile_dataset({"orders": real_df})
ProfileIO.save(profile, "orders_profile.json")

# Later: generate and score in one step
profile = ProfileIO.load("orders_profile.json")
schema = SchemaBuilder().build(profile)
result, fidelity = Spindle().generate(schema, seed=42, fidelity_profile=profile)

fidelity.summary()

Scoring Breakdown¶

Each column is scored 0–100 based on:

Metric	Points	Applies To
dtype match	10	All columns
Null rate delta	10	All columns
Cardinality ratio	10	All columns
Mean delta	20	Numeric columns
Std ratio	10	Numeric columns
KS statistic	10	Numeric columns
Value overlap	20	Categorical columns
Chi-squared	20	Categorical columns

Table scores are column averages. Overall score is a weighted average across tables.

CLI Reference¶

spindle compare REAL_PATH SYNTH_PATH [OPTIONS]

Option	Default	Description
`REAL_PATH`	—	Path to real data directory (required)
`SYNTH_PATH`	—	Path to synthetic data directory (required)
`--format`	`csv`	Input format: `csv`, `parquet`
`--output, -o`	—	Save markdown report to file

Mask: PII Anonymization¶

Mask sensitive columns in real datasets. Spindle auto-detects PII by column name and content patterns.

Quick Start¶

spindle mask ./real_data/ --output ./masked/ --format csv --seed 42

from sqllocks_spindle.inference.masker import DataMasker, MaskConfig

masker = DataMasker()
result = masker.mask(
    tables={"customer": customer_df, "order": order_df},
    config=MaskConfig(
        seed=42,
        preserve_nulls=True,
        preserve_fks=True,
        exclude_columns=["customer_id", "order_id"],
    ),
)

masked_customers = result.tables["customer"]
print(f"Masked {result.columns_masked} columns across {result.tables_masked} tables")

Auto-Detected PII Types¶

PII Type	Detection	Masking Method
`email`	Column name or regex pattern	Faker replacement
`phone`	Column name or regex pattern	Faker replacement
`name` / `first_name` / `last_name`	Column name	Faker replacement
`address` / `city` / `state` / `zip`	Column name	Faker replacement
`ssn`	Column name	Faker replacement
`credit_card`	Column name	Faker replacement
`ip_address`	Column name	Faker replacement
`username`	Column name	Faker replacement
`date_of_birth`	Column name	Random date shift

MaskConfig Options¶

MaskConfig(
    seed=42,                    # Reproducible masking
    locale="en_US",             # Faker locale
    preserve_nulls=True,        # Keep NULLs as NULL
    preserve_distributions=True,# Maintain statistical shape
    preserve_fks=True,          # Don't mask FK columns
    pii_columns={               # Override auto-detection
        "customer.secret_field": "email",
    },
    exclude_columns=["id"],     # Never mask these
)

CLI Reference¶

spindle mask INPUT_PATH [OPTIONS]

Option	Default	Description
`INPUT_PATH`	—	Directory or file to mask (required)
`--output, -o`	—	Output directory (required)
`--format`	`csv`	Input format: `csv`, `parquet`, `jsonl`
`--seed`	`42`	Random seed
`--exclude`	—	Columns to skip (repeatable)

End-to-End Example¶

# 1. Profile real data and generate a schema
spindle learn ./production_export/ --format parquet --output inferred.spindle.json

# 2. Generate synthetic data from the inferred schema
spindle generate --schema inferred.spindle.json --scale medium --output ./synthetic/

# 3. Compare fidelity
spindle compare ./production_export/ ./synthetic/ --format parquet --output fidelity.md

# 4. Mask the original data for dev environments
spindle mask ./production_export/ --output ./masked/ --format parquet --seed 42