Before You Start¶
New to synthetic data generation? This page covers what you need to know before diving into Spindle.
Prerequisites¶
- Python 3.10+ installed (python.org or via your package manager)
- pip for installing packages (included with Python)
- Basic familiarity with pandas DataFrames (Spindle outputs DataFrames)
- For Fabric workflows: access to a Microsoft Fabric workspace with a Lakehouse
No prior experience with synthetic data tools is required.
What is Synthetic Data?¶
Synthetic data is artificially generated data that mimics the statistical properties and structure of real-world datasets — without containing any actual sensitive information. It's used for:
- Development and testing — populate dashboards, test pipelines, train ML models
- Demos and presentations — realistic data that tells a compelling story
- Performance benchmarking — generate datasets at any scale (thousands to hundreds of millions of rows)
- Data quality testing — intentionally inject chaos (nulls, duplicates, schema drift) to stress-test your pipeline
How Spindle is Different¶
Most synthetic data tools either generate random noise (Faker) or train ML models on existing data (SDV, MOSTLY AI). Spindle takes a third approach:
- Rule-based and transparent — every generation rule is a human-readable
.spindle.jsonschema you can inspect and tweak - Calibrated from real sources — all 13 domains sourced from published data (BLS, NAIC, NCES, NAR, FDIC, Federal Reserve, SEC, and 40+ more)
- Schema-aware — generates tables in dependency order, respects FK integrity, handles composite keys
- Fabric-native — targets every Microsoft Fabric data store (Lakehouse, Warehouse, SQL Database, Eventhouse, Semantic Models)
Glossary¶
| Term | Definition |
|---|---|
| Domain | A pre-built industry schema (e.g., Retail, Healthcare). Spindle ships 13 domains. |
| Strategy | A column-level generation rule (e.g., faker_name, weighted_enum, pareto_fk). Spindle has 22 strategies. |
| Scale preset | A named size configuration (fabric_demo, small, medium, large, warehouse, xlarge). |
| Schema | A .spindle.json file defining tables, columns, strategies, relationships, and constraints. |
| Chaos Engine | A module that intentionally corrupts generated data to test pipeline resilience. |
| Validation Gate | A quality check (referential integrity, null constraints, schema conformance) with automatic quarantine. |
| Star Schema | A dimensional model (dimension + fact tables) transformed from Spindle's normalized 3NF output. |
| CDM | Common Data Model — a Microsoft standard folder structure for data interchange. |
| GSL | Generation Spec Language — a declarative YAML format that ties together schema, chaos, and output settings. |
| Composite Domain | Multi-domain generation with shared entities and cross-domain FK relationships. |
| DataProfiler | Profiles a real DataFrame or CSV to capture distributions, null rates, patterns, quantiles, and correlations. |
| SchemaBuilder | Converts a profiled dataset into a .spindle.json schema, auto-selecting the best strategy per column. |
| Empirical Strategy | A generation strategy that interpolates the source data's quantile fingerprint instead of fitting a parametric distribution. Used automatically when parametric fit is poor. |
| FidelityReport | Per-column statistical scoring (0–100) comparing generated data to source statistics. Measures distribution shape, null rates, cardinality, and patterns. |
| LakehouseProfiler | Fabric-native profiler that reads Delta tables over ABFSS and returns the same DatasetProfile as DataProfiler. Requires [fabric-inference] extra. |
| GaussianCopula | A post-generation pass that reorders column values to enforce target Pearson correlations without changing any column's marginal distribution. |
Learning Paths¶
Path 1: "I just need test data fast"¶
- 60-Second Overview — see it work in one command
- Quickstart — generate, inspect, export
- CLI Cheatsheet — all 12 commands at a glance
Path 2: "I'm building a Fabric project"¶
- Installation — install with Fabric extras
- Quickstart — generate your first dataset
- Fabric Lakehouse Guide — write to Delta tables
- Fabric Notebooks Guide — run in Fabric notebooks
- Star Schema Export — build dimensional models
Path 4: "I want to mirror real data statistically"¶
- Installation — install
[inference]or[fabric-inference]extra - Schema Learning — profile real data, infer schema
- Lakehouse Profiling — profile Fabric Delta tables directly
- Fidelity Scoring — measure how closely synthetic matches real
Path 3: "I want to understand everything"¶
- Installation — full setup with all extras
- Quickstart — first dataset
- Generation Strategies — all 21 column-level strategies
- Custom Schemas — build your own domain
- Chaos Engine — inject data quality issues
- Validation Gates — quality checks and quarantine
- Streaming — real-time event emission
- Methodology — how distributions are calibrated
Next Step¶
Ready? Start with the 60-Second Overview or jump straight to the Quickstart.