Tutorial 18: Fidelity Reporting¶

Duration: ~20 minutes
Level: Intermediate
Prerequisites: Complete Tutorial 01 (Hello Spindle) and have a CSV or Parquet file of real data handy.
Extras required: pip install sqllocks-spindle[inference]

What You'll Build¶

A pipeline that:

Profiles a real CSV file with DataProfiler
Infers a Spindle schema with SchemaBuilder
Generates synthetic data with Spindle.generate()
Measures statistical fidelity with FidelityReport
Identifies failing columns and iterates on the schema

Step 1: Set Up¶

pip install sqllocks-spindle[inference]

We'll use a sample orders CSV for this tutorial. If you don't have real data, create a synthetic one first:

from sqllocks_spindle import Spindle, RetailDomain

result = Spindle().generate(domain=RetailDomain(), scale="small", seed=42)
real_orders = result.tables["order"]
real_orders.to_csv("real_orders.csv", index=False)
print(f"Created real_orders.csv: {len(real_orders):,} rows")

Step 2: Profile the Real Data¶

from sqllocks_spindle.inference import DataProfiler, ProfileIO

profiler = DataProfiler(
    fit_threshold=0.80,   # columns with KS fit < 0.80 get empirical strategy
    sample_rows=None,     # full scan for small files; use an int for large files
)

profile = DataProfiler.from_csv("real_orders.csv")

# Save the profile for reuse
ProfileIO.save(profile, "orders_profile.json")
print("Profile saved.")

Step 3: Inspect the Profile¶

from sqllocks_spindle.inference import ProfileIO

profile = ProfileIO.load("orders_profile.json")

# Inspect one table
table = profile.tables["real_orders"]  # key matches CSV filename stem
for col in table.columns:
    print(f"  {col.name}: null_rate={col.null_rate:.2%}, fit_score={col.fit_score}")
    if col.quantiles:
        print(f"    → empirical strategy queued (fit < threshold)")

Step 4: Build a Schema¶

from sqllocks_spindle.inference import SchemaBuilder

schema, registry = SchemaBuilder().build(
    profile,
    domain_name="orders",
    fit_threshold=0.80,
    correlation_threshold=0.5,
    include_anomaly_registry=True,
)

print("Schema built.")
print(f"Suggested registry anomaly types: {[type(a).__name__ for a in registry.anomalies]}")

Step 5: Generate with Fidelity Scoring¶

from sqllocks_spindle import Spindle

result, fidelity = Spindle().generate(
    schema,
    seed=42,
    fidelity_profile=profile,
)

print(f"Generated {result.total_rows:,} rows across {len(result.table_names)} tables")

Step 6: Read the Fidelity Report¶

# Print the full per-column table
fidelity.summary()

# Find columns below 85% threshold
failing = fidelity.failing_columns(threshold=85.0)
if failing:
    print("\nFailing columns:")
    for table, col, score in failing:
        print(f"  {table}.{col}: {score:.1f}/100")
else:
    print("\nAll columns above threshold.")

# Export as a DataFrame
df_scores = fidelity.to_dataframe()
print(df_scores.sort_values("score").head(10))

Step 7: Iterate on Failing Columns¶

If any columns fail, the most common fixes are:

A. Force empirical strategy — lower fit_threshold so more columns use quantile interpolation:

schema = SchemaBuilder().build(profile, domain_name="orders", fit_threshold=0.60)
result, fidelity = Spindle().generate(schema, seed=42, fidelity_profile=profile)
fidelity.summary()

B. Check the column in the profile — if it has an unusual distribution that the profiler doesn't capture well, inspect the quantiles field and ensure it's populated:

table = profile.tables["real_orders"]
col = next(c for c in table.columns if c.name == "order_amount")
print(col.quantiles)  # should show p1..p99 values

C. Manually set quantiles in the schema JSON — for full control, open the inferred .spindle.json and adjust the strategy directly.

Step 8: Save the Report¶

import json
from pathlib import Path

Path("reports").mkdir(exist_ok=True)
json.dump(fidelity.to_dict(), open("reports/fidelity_report.json", "w"), indent=2)
fidelity.to_dataframe().to_csv("reports/fidelity_scores.csv", index=False)
print("Reports saved to reports/")

Summary¶

You've completed the fidelity reporting loop:

Step	Tool
Profile real data	`DataProfiler.from_csv()`
Infer schema	`SchemaBuilder().build()`
Generate + score	`Spindle().generate(..., fidelity_profile=...)`
Identify gaps	`fidelity.failing_columns()`
Iterate	Lower `fit_threshold`, force `empirical` strategy

Next Steps¶

Lakehouse Profiling — run this same pipeline against a Fabric Delta table
Guide: Fidelity Scoring — deep dive on scoring dimensions
Guide: Empirical Distributions — how quantile interpolation works