Before You Start¶

New to synthetic data generation? This page covers what you need to know before diving into Spindle.

Prerequisites¶

Python 3.10+ installed (python.org or via your package manager)
pip for installing packages (included with Python)
Basic familiarity with pandas DataFrames (Spindle outputs DataFrames)
For Fabric workflows: access to a Microsoft Fabric workspace with a Lakehouse

No prior experience with synthetic data tools is required.

What is Synthetic Data?¶

Synthetic data is artificially generated data that mimics the statistical properties and structure of real-world datasets — without containing any actual sensitive information. It's used for:

Development and testing — populate dashboards, test pipelines, train ML models
Demos and presentations — realistic data that tells a compelling story
Performance benchmarking — generate datasets at any scale (thousands to hundreds of millions of rows)
Data quality testing — intentionally inject chaos (nulls, duplicates, schema drift) to stress-test your pipeline

How Spindle is Different¶

Most synthetic data tools either generate random noise (Faker) or train ML models on existing data (SDV, MOSTLY AI). Spindle takes a third approach:

Rule-based and transparent — every generation rule is a human-readable .spindle.json schema you can inspect and tweak
Calibrated from real sources — all 13 domains sourced from published data (BLS, NAIC, NCES, NAR, FDIC, Federal Reserve, SEC, and 40+ more)
Schema-aware — generates tables in dependency order, respects FK integrity, handles composite keys
Fabric-native — targets every Microsoft Fabric data store (Lakehouse, Warehouse, SQL Database, Eventhouse, Semantic Models)

Glossary¶

Term	Definition
Domain	A pre-built industry schema (e.g., Retail, Healthcare). Spindle ships 13 domains.
Strategy	A column-level generation rule (e.g., `faker_name`, `weighted_enum`, `pareto_fk`). Spindle has 22 strategies.
Scale preset	A named size configuration (`fabric_demo`, `small`, `medium`, `large`, `warehouse`, `xlarge`).
Schema	A `.spindle.json` file defining tables, columns, strategies, relationships, and constraints.
Chaos Engine	A module that intentionally corrupts generated data to test pipeline resilience.
Validation Gate	A quality check (referential integrity, null constraints, schema conformance) with automatic quarantine.
Star Schema	A dimensional model (dimension + fact tables) transformed from Spindle's normalized 3NF output.
CDM	Common Data Model — a Microsoft standard folder structure for data interchange.
GSL	Generation Spec Language — a declarative YAML format that ties together schema, chaos, and output settings.
Composite Domain	Multi-domain generation with shared entities and cross-domain FK relationships.
DataProfiler	Profiles a real DataFrame or CSV to capture distributions, null rates, patterns, quantiles, and correlations.
SchemaBuilder	Converts a profiled dataset into a `.spindle.json` schema, auto-selecting the best strategy per column.
Empirical Strategy	A generation strategy that interpolates the source data's quantile fingerprint instead of fitting a parametric distribution. Used automatically when parametric fit is poor.
FidelityReport	Per-column statistical scoring (0–100) comparing generated data to source statistics. Measures distribution shape, null rates, cardinality, and patterns.
LakehouseProfiler	Fabric-native profiler that reads Delta tables over ABFSS and returns the same `DatasetProfile` as `DataProfiler`. Requires `[fabric-inference]` extra.
GaussianCopula	A post-generation pass that reorders column values to enforce target Pearson correlations without changing any column's marginal distribution.

Learning Paths¶

Path 1: "I just need test data fast"¶

60-Second Overview — see it work in one command
Quickstart — generate, inspect, export
CLI Cheatsheet — all 12 commands at a glance

Path 2: "I'm building a Fabric project"¶

Installation — install with Fabric extras
Quickstart — generate your first dataset
Fabric Lakehouse Guide — write to Delta tables
Fabric Notebooks Guide — run in Fabric notebooks
Star Schema Export — build dimensional models

Path 4: "I want to mirror real data statistically"¶

Installation — install [inference] or [fabric-inference] extra
Schema Learning — profile real data, infer schema
Lakehouse Profiling — profile Fabric Delta tables directly
Fidelity Scoring — measure how closely synthetic matches real

Path 3: "I want to understand everything"¶

Installation — full setup with all extras
Quickstart — first dataset
Generation Strategies — all 21 column-level strategies
Custom Schemas — build your own domain
Chaos Engine — inject data quality issues
Validation Gates — quality checks and quarantine
Streaming — real-time event emission
Methodology — how distributions are calibrated

Next Step¶

Ready? Start with the 60-Second Overview or jump straight to the Quickstart.