Skip to content

Generation Strategies

Spindle uses 22 column-level generation strategies. Each column in a .spindle.json schema specifies a generator with a strategy name and strategy-specific parameters.

Quick Reference

Strategy Purpose Example Use
sequence Auto-incrementing integers Primary keys
uuid UUID v4 strings Alternative primary keys
faker Realistic fake data via Faker Names, emails, phone numbers
weighted_enum Weighted random selection Status codes, categories
distribution Statistical distributions Prices, ages, quantities
empirical Quantile-fingerprint interpolation Numeric columns when parametric fit is poor
temporal Time-aware dates/timestamps Order dates with seasonality
formula Computed from other columns quantity * unit_price
derived Transformed from another column return_date = order_date + N days
correlated Mathematically related cost = unit_price * 0.30-0.70
conditional Different logic per row Discount only if promotion exists
lifecycle Phase-based status values active / introduced / discontinued
foreign_key FK references with distribution Pareto, Zipf, or uniform FK assignment
lookup Copy value from parent table Line item price from product
reference_data Pick from bundled datasets ZIP codes, ICD-10 codes
pattern Formatted strings with tokens SKU-{seq:6}, Store #{seq:04d}
computed Aggregated from child rows order_total = sum(line_totals)
self_referencing FK to same table Category hierarchy (parent_id)
self_ref_field Read hierarchy metadata Level number from self-referencing
first_per_parent Boolean flag for first child Primary address marker
record_sample Sample complete records Anchor for correlated reference data
record_field Read field from sampled record city/state/zip from sampled address

Primary Key Strategies

sequence

Auto-incrementing integer sequences.

{
  "strategy": "sequence",
  "start": 1,
  "step": 1
}
Param Type Default Description
start int 1 Starting value
step int 1 Increment per row

uuid

UUID v4 strings. No parameters.

{
  "strategy": "uuid"
}

Data Generation Strategies

faker

Generate realistic fake data using the Faker library.

{
  "strategy": "faker",
  "provider": "first_name"
}
Param Type Default Description
provider str Faker provider name (e.g., first_name, email, city, phone_number)
args dict {} Arguments passed to the Faker provider

The locale is inherited from model.locale in the schema (default en_US).

Common providers: first_name, last_name, name, email, phone_number, street_address, city, state_abbr, zipcode, company, url, ssn, sentence, text, user_name, ipv4.

weighted_enum

Pick values from a weighted set.

{
  "strategy": "weighted_enum",
  "values": {
    "completed": 0.77,
    "shipped": 0.08,
    "processing": 0.02,
    "cancelled": 0.04,
    "returned": 0.09
  }
}
Param Type Description
values dict {value: weight} — weights are normalized automatically

Note

If all keys are numeric strings (e.g., "0.0", "10.0"), the strategy returns a float64 array instead of strings.

distribution

Statistical distributions powered by NumPy.

{
  "strategy": "distribution",
  "distribution": "log_normal",
  "params": {"mean": 3.5, "sigma": 1.2, "min": 0.99, "max": 2999.99}
}
Param Type Description
distribution str Distribution name (see table below)
params dict Distribution-specific parameters

Available distributions:

Distribution Params Use Case
uniform min, max Equal probability range
normal mean, std_dev, min, max Bell curve (ages, sizes)
log_normal mean, sigma, min, max Right-skewed (prices, amounts)
pareto alpha, min, max 80/20 distributions (order frequency)
zipf alpha Power law (product popularity)
geometric p, min, max "Tries until success" (quantities)
poisson lambda Count events per interval
bernoulli probability Yes/no (returns, churn)

empirical

Quantile-fingerprint interpolation for numeric columns where no standard distribution fits well. Used automatically by SchemaBuilder when the KS test fit score falls below fit_threshold (default 0.80). You can also write it directly in a schema.

{
  "strategy": "empirical",
  "quantiles": {
    "p1": 1.25,
    "p5": 4.99,
    "p10": 9.99,
    "p25": 24.99,
    "p50": 49.99,
    "p75": 99.99,
    "p90": 199.99,
    "p95": 349.99,
    "p99": 799.99
  },
  "interpolation": "linear"
}
Param Type Default Description
quantiles dict Required. Keys must be p1, p5, p10, p25, p50, p75, p90, p95, p99
interpolation str "linear" "linear" (NumPy, no extra deps) or "cubic" (scipy — falls back to linear if scipy absent)

Values are generated by drawing uniform samples and mapping them through the quantile curve. This preserves the original distribution's shape exactly at the fingerprint points and interpolates between them.

Note

SchemaBuilder.build() automatically emits the empirical strategy for numeric columns when profile.fit_score < fit_threshold. The quantile values come from the profiler's ColumnProfile.quantiles field.

temporal

Time-aware date and timestamp generation with optional seasonality.

{
  "strategy": "temporal",
  "pattern": "uniform",
  "range_ref": "model.date_range"
}
{
  "strategy": "temporal",
  "pattern": "seasonal",
  "range_ref": "model.date_range",
  "profiles": {
    "month": {"Jan": 0.06, "Feb": 0.06, "Nov": 0.11, "Dec": 0.13},
    "day_of_week": {"Mon": 0.13, "Fri": 0.16, "Sat": 0.17},
    "hour_of_day": {"distribution": "bimodal", "peaks": [11, 20], "std_dev": 2}
  }
}
Param Type Description
pattern str uniform or seasonal
range dict {start, end} date strings
range_ref str Reference to model-level date range (e.g., model.date_range)
profiles dict Monthly, day-of-week, and hour-of-day weight profiles

lifecycle

Assign phase labels based on weighted probabilities.

{
  "strategy": "lifecycle",
  "phases": {
    "introduced": 0.10,
    "active": 0.75,
    "discontinued": 0.15
  }
}
Param Type Description
phases dict {phase_name: weight} — same as weighted_enum but semantically for lifecycle states

Relationship Strategies

foreign_key

Reference parent table primary keys with configurable distribution.

{
  "strategy": "foreign_key",
  "ref": "customer.customer_id",
  "distribution": "pareto",
  "params": {"alpha": 1.16}
}
Param Type Default Description
ref str table.column reference to parent PK
distribution str uniform uniform, pareto, or zipf
params dict {} Distribution parameters (e.g., alpha)
constrained_by str Scope FK to match another FK (e.g., address must belong to same customer)
sample_rate float Sample only a fraction of parent PKs
filter str SQL-like filter on parent rows (e.g., status = 'completed')

Tip

Use distribution: "pareto" with alpha: 1.16 for the classic 80/20 rule — 20% of customers generate 80% of orders.

lookup

Copy a value from a parent table via a foreign key join.

{
  "strategy": "lookup",
  "source_table": "product",
  "source_column": "unit_price",
  "via": "product_id"
}
Param Type Description
source_table str Parent table name
source_column str Column to copy from parent
via str FK column in current table that links to parent

self_referencing

Create a hierarchy within a single table (e.g., category tree, org chart).

{
  "strategy": "self_referencing",
  "pk_column": "category_id",
  "root_count": 8
}
Param Type Description
pk_column str Primary key column of the same table
root_count int Number of root-level rows (NULL parent)
levels int Number of hierarchy levels (from relationship def)

self_ref_field

Read metadata stashed by a self_referencing strategy (e.g., the hierarchy level).

{
  "strategy": "self_ref_field",
  "field": "level"
}

first_per_parent

Mark the first child row per parent group as True, rest as False.

{
  "strategy": "first_per_parent",
  "parent_column": "customer_id",
  "default": true
}

Computed & Derived Strategies

formula

Compute a column value from other columns using a math expression.

{
  "strategy": "formula",
  "expression": "quantity * unit_price * (1 - discount_percent / 100)"
}
Param Type Description
expression str Python math expression referencing other column names

The expression is evaluated with safe builtins only (no arbitrary code execution).

derived

Derive a value from another column with an optional transformation.

{
  "strategy": "derived",
  "source": "start_date",
  "rule": "add_days",
  "params": {"distribution": "uniform", "min": 1, "max": 30}
}
{
  "strategy": "derived",
  "source": "order.order_date",
  "via": "order_id",
  "rule": "add_days",
  "params": {"distribution": "log_normal", "mean": 2.0, "sigma": 0.8, "min": 1, "max": 90}
}
Param Type Description
source str Source column (or table.column for cross-table)
via str FK column for cross-table lookup
rule str Transformation: copy, add_days
params dict Rule-specific parameters

correlated

Generate a value mathematically related to another column.

{
  "strategy": "correlated",
  "source_column": "unit_price",
  "rule": "multiply",
  "params": {"factor_min": 0.30, "factor_max": 0.70}
}
Param Type Description
source_column str Column to correlate with
rule str multiply, add, subtract
params dict factor_min/factor_max (multiply) or offset_min/offset_max (add/subtract)

conditional

Generate different values depending on a row-level condition.

{
  "strategy": "conditional",
  "condition": "promo_id IS NOT NULL",
  "true_generator": {"strategy": "lookup", "source_table": "promotion", "source_column": "discount_value", "via": "order.promotion_id"},
  "false_generator": {"fixed": 0.00}
}
Param Type Description
condition str IS NOT NULL, IS NULL, == value, != value
true_generator dict Generator config for rows matching condition
false_generator dict Generator config for rows not matching condition

Inline generators can be a full strategy dict or {"fixed": value}.

computed

Placeholder for post-generation aggregation from child tables. Backfilled after all tables are generated.

{
  "strategy": "computed",
  "rule": "sum_children",
  "child_table": "order_line",
  "child_column": "line_total"
}
Param Type Description
rule str sum_children, count_children, avg_children, min_children, max_children, lookup_parent
child_table str Child table to aggregate from
child_column str Column to aggregate
parent_table str For lookup_parent: parent table
via str FK column linking child to parent

Reference Data Strategies

reference_data

Pick values from bundled JSON datasets shipped with each domain.

{
  "strategy": "reference_data",
  "dataset": "retail_categories"
}
Param Type Description
dataset str Name of the reference dataset (domain-specific)

record_sample

Sample complete records from a reference dataset. This is the anchor strategy — it picks one record per row and stashes all fields for use by record_field.

{
  "strategy": "record_sample",
  "dataset": "us_zip_locations",
  "field": "city"
}
Param Type Description
dataset str Name of a JSON dataset containing arrays of objects
field str Which field to use as this column's value

record_field

Read a field from records already sampled by record_sample. Must appear after the anchor column in schema order.

{
  "strategy": "record_field",
  "dataset": "us_zip_locations",
  "field": "state"
}

This is how Spindle generates correlated multi-column reference data (e.g., city + state + zip + lat + lng all from the same real US location).


String Formatting

pattern

Generate formatted strings with token substitution.

{
  "strategy": "pattern",
  "format": "SKU-{category_code}-{seq:6}"
}
Token Description
{seq:N} Zero-padded sequence number (N digits)
{random:N} Random alphanumeric string (N chars)
{column_name} Value from another column in the same row

See Also