Skip to content

inference

sqllocks_spindle.inference

Spindle inference engine — profile existing data and infer schemas.

Provides DataProfiler for analysing DataFrames and SchemaBuilder for converting profiles into ready-to-use SpindleSchema objects. Also includes FidelityComparator for comparing real vs synthetic data.

Classes

DataMasker

Replace PII in real data with synthetic values preserving distributions.

Methods:
mask(tables, config=None)

Mask PII columns across all tables.

Parameters

tables: Mapping of table name to DataFrame. config: Optional masking configuration. Defaults are sensible.

Returns

MaskResult with masked DataFrames and statistics.

MaskConfig dataclass

Configuration for data masking.

MaskResult dataclass

Result of masking operation.

Methods:
summary()

Return a human-readable summary of the masking result.

ColumnFidelity dataclass

Fidelity metrics for a single column.

FidelityComparator

Compare real and synthetic datasets to produce a fidelity report.

Methods:
compare(real, synthetic)

Compare real vs synthetic data across all shared tables.

FidelityReport dataclass

Complete fidelity report comparing real vs synthetic data.

Methods:
summary()

Generate a plain-text summary.

to_markdown()

Generate markdown report.

failing_columns(threshold=85.0)

Return (table, column, score) tuples for columns below threshold.

Parameters:

Name Type Description Default
threshold float

Score threshold (0-100). Columns with score < threshold are included.

85.0

Returns:

Type Description
list[tuple[str, str, float]]

List of (table_name, column_name, score) tuples, sorted by score (lowest first).

to_dict()

Return a JSON-serializable dict representation.

to_dataframe()

Return a flat pandas DataFrame with one row per column.

to_html(title='Spindle Fidelity Report')

Render fidelity report as a self-contained HTML page.

Uses inline CSS — no external dependencies. Score bands: green ≥ 85, amber 70-84, red < 70.

score(real, synthetic, table_name='table', threshold=85.0) classmethod

Compare two DataFrames and return a FidelityReport.

Convenience classmethod for single-table comparison.

Parameters:

Name Type Description Default
real 'pd.DataFrame'

Real data DataFrame.

required
synthetic 'pd.DataFrame'

Synthetic data DataFrame to compare.

required
table_name str

Name for the table in the report (default: "table").

'table'
threshold float

Score threshold for failing_columns() (default: 85.0).

85.0

Returns:

Type Description
'FidelityReport'

FidelityReport comparing the two DataFrames.

TableFidelity dataclass

Fidelity metrics for a table.

ColumnProfile dataclass

Statistical profile of a single column.

DataProfiler

Analyse one or more DataFrames and produce profiles.

Methods:
profile_dataframe(df, table_name='table')

Profile a single DataFrame.

profile_dataset(tables)

Profile a dict of DataFrames and detect cross-table relationships.

profile(df, table_name='table')

Alias for profile_dataframe(). Profile a single DataFrame.

from_csv(path, table_name=None, sample_rows=None, **kwargs) classmethod

Profile a CSV file.

Parameters:

Name Type Description Default
path str | Path

Path to the CSV file.

required
table_name str | None

Name for the table profile. Defaults to the filename stem.

None
sample_rows int | None

If set, sample this many rows before profiling.

None
**kwargs

Passed to DataProfiler constructor (fit_threshold, top_n_values, etc.).

{}

DatasetProfile dataclass

Profile of a multi-table dataset.

TableProfile dataclass

Profile of a single table (DataFrame).

ExportedProfile dataclass

A portable profile that can be imported into any domain.

Attributes:

Name Type Description
name str

Profile identifier (e.g. "default", "high_volume").

description str

Human-readable description of what this profile represents.

source_domain str

Name of the domain this profile was exported from (or "inferred" when created via :meth:ProfileIO.from_dataframe).

distributions dict[str, dict[str, float]]

Mapping of "table.column" keys to value→weight dicts.

ratios dict[str, float]

Mapping of ratio names to float multipliers.

metadata dict[str, Any]

Arbitrary extra information (row counts, column types, etc.).

ProfileIO

Export, import, and list domain profiles.

All public methods are stateless — no configuration is stored on the instance. Instantiate with ProfileIO() and call methods directly.

Example::

io = ProfileIO()
io.export_profile(RetailDomain(), Path("retail_profile.json"))
io.import_profile(Path("retail_profile.json"), HealthcareDomain(), save_as="from_retail")
io.list_profiles(RetailDomain())
Methods:
export_profile(domain, output_path, profile_name='default')

Export a domain's active profile to a standalone JSON file.

Parameters:

Name Type Description Default
domain Any

A :class:~sqllocks_spindle.domains.base.Domain instance whose _profile dict will be serialised.

required
output_path str | Path

Destination file path (created if it does not exist).

required
profile_name str

Label stored in the exported metadata.

'default'

Returns:

Type Description
Path

The resolved :class:Path the profile was written to.

import_profile(profile_path, target_domain, save_as=None)

Import an exported profile into a target domain's profiles/ directory.

The imported file is converted to the standard domain profile format (i.e. metadata is stripped; only name, description, distributions, and ratios are kept).

Parameters:

Name Type Description Default
profile_path str | Path

Path to an exported profile JSON file.

required
target_domain Any

The domain instance to import into.

required
save_as str | None

Override the profile name (and filename). When None the name is taken from the file's "name" field.

None

Returns:

Type Description
str

The name the profile was saved as.

list_profiles(domain)

List all profiles available for a domain.

Parameters:

Name Type Description Default
domain Any

A :class:~sqllocks_spindle.domains.base.Domain instance.

required

Returns:

Type Description
list[dict[str, str | int]]

A list of dicts with keys name, description,

list[dict[str, str | int]]

distributions (count), and ratios (count).

from_dataframe(df, table_name='table', name='inferred')

Create a profile by inferring distributions from a DataFrame.

Categorical columns (object dtype or low cardinality) are converted into normalised distribution weights. High-cardinality columns are skipped.

Parameters:

Name Type Description Default
df DataFrame

The source DataFrame.

required
table_name str

Prefix for distribution keys ("table_name.column").

'table'
name str

Name to assign to the resulting profile.

'inferred'

Returns:

Name Type Description
An ExportedProfile

class:ExportedProfile ready to be serialised or imported.

ProfileStore

Persist and retrieve a :class:SafeProfile to/from a JSON file.

All methods are stateless — instantiate with ProfileStore() and call directly, or use the classmethods. This is the only supported public on-disk entrypoint for a SafeProfile (ADR-001 / ADR-007).

Methods:
save(profile, path) classmethod

Write profile to path as JSON (via to_safe_dict).

Parameters:

Name Type Description Default
profile SafeProfile

The :class:SafeProfile to persist. By construction it carries no raw-bearing fields (ADR-007), so the on-disk JSON is safe-by-construction.

required
path str | Path

Destination file path. Parent directories are created.

required

Returns:

Type Description
Path

The resolved :class:Path the profile was written to.

load(path) classmethod

Read a SafeProfile from a JSON file written by :meth:save.

A file whose schema_version is not the version this code writes (e.g. a legacy artifact with no/old version, or a future version) is loaded read-only with a warning — it never crashes. The returned object is degraded-but-usable: the keys present are reconstructed, and the loaded schema_version is preserved on the returned object so a caller can detect the mismatch.

Parameters:

Name Type Description Default
path str | Path

Path to a JSON file previously written by :meth:save.

required

Returns:

Type Description
SafeProfile

A reconstructed :class:SafeProfile.

SafeColumnProfile dataclass

Safe, persisted statistic set for a single column.

Carries ONLY non-raw-bearing statistics. Notably absent (by construction): min_value, max_value, enum_values, value_counts_ext.

Numeric extremes live in bounds (winsorized quantile bounds, ADR-002, populated by STORY-006). Categorical mass lives in categorical_weights (post-k-anon suppression, ADR-003, populated by STORY-007).

Methods:
from_column_profile(col, config=None, row_count=None) classmethod

Map a rich ColumnProfile to a SafeColumnProfile (STORY-002).

Selects ONLY the safe-and-sufficient statistic set. Reads the REAL attribute names on ColumnProfile (min_value/max_value are never read — ADR-002; bounds derive from quantiles). This fixes the B2 attribute-mismatch bug class where the legacy registry read non-existent .min/.max/.top_values.

Disclosure-control transforms are applied via hooks that are STUBS in this story and become real in their owning stories:

  • bounds — winsorized quantile bounds (STORY-006 / ADR-002). Stub here: {"lo": p1, "hi": p99} taken from quantiles if present.
  • categorical_weights — k-anon suppression (STORY-007 / ADR-003): any value with count < k folded into a single __OTHER__ bucket. count is derived from the seeded proportion x row_count (the rich enum_values / value_counts_ext carry value->proportion, not raw counts). row_count is threaded in by the table mapper.
  • value-pattern PII gate (STORY-008 / ADR-004). When a column's detected pattern is a PII class (:pydata:PII_PATTERNS) OR its cardinality is approximately the row count (high-card free-text backstop), the column persists pattern + length_dist ONLY — categorical_ weights are dropped and no values are carried. Detection is name-independent (catches PII in notes / c_47). This is DEFENSE-IN-DEPTH, NOT a completeness guarantee (ADR-004 / ADR-011).
to_safe_dict()

Serialize to a plain dict. Deterministic key order for byte-stability.

SafeProfile dataclass

The canonical, versioned, on-disk safe profile (ADR-001).

Top-level transport object. Carries schema_version and an embedded redaction_manifest (populated by STORY-009 — present but empty here).

Methods:
from_dataset_profile(dataset_profile, config=None, unsafe_full_fidelity=False) classmethod

Map a rich DatasetProfile to a SafeProfile (STORY-002 / ADR-001).

Builds one SafeTableProfile per table and one SafeColumnProfile per column, selecting ONLY the safe-and-sufficient statistic set. The rich profile is the source; the returned SafeProfile is the safe transport.

The mapper reads only REAL attribute names on the rich dataclasses (min_value/max_value are never read — bounds derive from quantiles per ADR-002), fixing the B2 attribute-mismatch bug class.

config is an optional per-profile/per-column settings dict threaded to the disclosure-control hooks (winsorization percentiles, k-anon k, PII gate).

Safe-by-default (ADR-005 / STORY-009)

The scrub — winsorized bounds (ADR-002), k-anon __OTHER__ suppression (ADR-003), and the value-pattern PII gate (ADR-004) — runs by DEFAULT. The safe path is the path of least resistance.

unsafe_full_fidelity=True is the explicit, single opt-out. It disables the disclosure-control transforms (k-anon suppression and the PII gate are turned off so full-fidelity categorical weights / values survive) and stamps unsafe=True on the returned profile. Such an artifact is rejected by validate --safe (STORY-010). It is the ONLY way to persist un-scrubbed statistics.

Every returned profile carries an accurate redaction_manifest (built from the rich source vs. the scrubbed safe columns — see :func:build_redaction_manifest).

to_safe_dict()

Serialize to a plain dict with deterministic key order.

SafeTableProfile dataclass

Safe, persisted profile for a single table.

Methods:
from_table_profile(table, config=None) classmethod

Map a rich TableProfile to a SafeTableProfile (STORY-002).

One SafeColumnProfile per column. Carries the table-level correlation_matrix, primary_key and advisory detected_fks (names/overlap only — no raw values). Column order is preserved.

to_safe_dict()

Serialize to a plain dict. Columns serialized in declared order.

SafeProfileAdapter

Adapt a loaded :class:SafeProfile to a generatable :class:SpindleSchema.

Stateless; instantiate and call :meth:to_schema, or use the module-level :func:safe_profile_to_schema convenience wrapper.

Methods:
to_schema(profile, domain_name='safe_inferred')

Build a :class:SpindleSchema from a loaded :class:SafeProfile.

The returned schema is ready to pass to Spindle().generate(schema=..., fidelity_profile=profile).

Raw fields are never consulted (there are none on the safe model); numeric clipping is driven by the winsorized bounds (ADR-002).

SafeProfileValidator

Structural, fail-closed static leak scanner over a serialized artifact.

Usage::

result = SafeProfileValidator().validate_file("profile.json")
sys.exit(result.exit_code)
Methods:
validate_file(path)

Load and scan a JSON artifact file. Fail-closed on any read error.

validate_data(data, path='<data>')

Scan an already-parsed artifact. Fail-closed on missing markers.

ValidationFinding dataclass

A single leak finding, with the JSON path that triggered it.

ValidationResult dataclass

Outcome of a scan. is_clean only when zero findings.

Attributes
exit_code property

0 only on a proven-clean artifact; 1 on any finding.

SchemaBuilder

Convert a DatasetProfile into a SpindleSchema.

Methods:
build(profile, domain_name='inferred', fit_threshold=0.8, correlation_threshold=0.5, include_anomaly_registry=False)

Build a complete SpindleSchema from a dataset profile.

LakehouseProfiler

Profile Fabric Lakehouse Delta tables and return TableProfile objects.

Parameters:

Name Type Description Default
workspace_id str

Fabric workspace GUID.

required
lakehouse_id str

Fabric lakehouse GUID.

required
token_provider Any | None

A callable returning an Azure access token string. Defaults to DefaultAzureCredential when azure-identity is installed.

None
default_sample_rows int | None

Row limit for profiling. Pass None to scan entire table.

100000
Methods:
profile_table(table_name, sample_rows='default')

Profile a single Delta table.

profile_all(sample_rows='default')

Profile all tables in the lakehouse.

detect_foreign_keys(table_names=None, overlap_threshold=0.9, sample_rows='default', full_scan=False)

Sampled cross-table FK detection (advisory). ADR-009 / STORY-016.

Reads each table's columns (sampled by default) and runs the proven DataProfiler._detect_foreign_keys_advisory core (naming *_id plus value-overlap >= overlap_threshold) across every table pair. Detected FKs are advisory and reported with the measured overlap; a declared star_map / RelationshipDef remains authoritative and overrides (resolved by the caller, not here).

Parameters:

Name Type Description Default
table_names list[str] | None

Tables to scan. Defaults to all tables in the lakehouse.

None
overlap_threshold float

Minimum child-to-parent value overlap to report a FK (default 0.9, configurable per ADR-009).

0.9
sample_rows int | None | str

Per-table row cap used when reading key columns. "default" uses self.default_sample_rows; None reads the full table. Ignored when full_scan=True.

'default'
full_scan bool

Read entire tables (no sampling) to confirm a sampled result (ADR-009 full-scan option).

False

Returns:

Type Description
dict[str, dict[str, dict[str, Any]]]

``{child_table: {col_name: {"parent_table": str, "overlap": float,

dict[str, dict[str, dict[str, Any]]]

"advisory": True, "full_scan": bool}}}`` for every detected FK.

reconcile_declared_foreign_keys(detected, declared) staticmethod

Declared FKs override detected advisory FKs (ADR-009 / STORY-017).

A declared star_map / RelationshipDef is AUTHORITATIVE: where a declaration exists for a (child_table, child_col) it wins over any detected FK, even a high-overlap one. Detected FKs that a declaration overrode are REPORTED (not silently dropped) for transparency.

Parameters:

Name Type Description Default
detected dict[str, dict[str, dict[str, Any]]]

the output of :meth:detect_foreign_keys.

required
declared Any

iterable of (child_table, child_col, parent_table) tuples, or dicts with those keys.

required

Returns:

Type Description
dict[str, Any]

{"foreign_keys": <resolved map>, "overridden": [<reports>]}.

dict[str, Any]

Resolved declared entries carry advisory=False, declared=True.

BootstrapMode

Generate synthetic data by bootstrapping (sampling with replacement) from real data.

The simplest form of synthetic generation — preserves all real distributions exactly, but does not generalize beyond the source data. Useful as a baseline.

Methods:
generate(source, n_rows=None, table_name='table', seed=42)

Generate synthetic DataFrame by bootstrapping source.

Parameters:

Name Type Description Default
source DataFrame

Real data to bootstrap from.

required
n_rows int | None

Number of rows to generate (default: same as source).

None
table_name str

Name for result metadata.

'table'
seed int

Random seed.

42

Returns:

Type Description
tuple[DataFrame, BootstrapResult]

(synthetic_df, BootstrapResult)

BootstrapResult dataclass

Result of bootstrap synthetic generation.

BayesianEdge dataclass

A directed edge in the Chow-Liu tree.

ChowLiuNetwork

Learn a Bayesian network tree structure using the Chow-Liu algorithm.

Computes pairwise mutual information between columns and finds the maximum spanning tree — the tree that best represents the joint distribution.

This is the theoretical backbone of synthetic data that preserves inter-column dependencies.

Methods:
fit(df)

Learn the Chow-Liu tree from a DataFrame.

ChowLiuResult dataclass

Result of Chow-Liu tree structure learning.

CTGANWrapper

Optional wrapper around CTGAN/TVAE from the sdv library.

Falls back gracefully if sdv is not installed. When available, CTGAN provides deep generative model quality for tabular data.

Install with: pip install sqllocks-spindle[deep]

Methods:
fit(df, discrete_columns=None)

Fit the CTGAN model on real data.

sample(n_rows)

Sample from the fitted CTGAN model.

DifferentialPrivacy

Apply Laplace or Gaussian noise to achieve (ε,δ)-differential privacy.

For synthetic data, this adds calibrated noise to numeric columns proportional to their sensitivity / ε, ensuring individual records cannot be re-identified.

Methods:
apply(df, rng=None)

Apply differential privacy noise to all numeric columns.

Returns (noised_df, DPResult).

DPResult dataclass

Result of applying differential privacy noise.

DriftMonitor

Detect statistical drift between reference and current DataFrames.

Uses KS test for numeric columns, Chi-squared for categoricals, and PSI as a supplementary signal.

Methods:
compare(reference, current)

Compare reference and current DataFrames for drift.

DriftReport dataclass

Drift report comparing a reference and current DataFrame.

ColumnDriftResult dataclass

Drift result for a single column.

AnomalyRateResult dataclass

Checks whether the injected anomaly rate matches the registered anomaly fraction.

CardinalityConstraintChecker

Check that synthetic cardinality stays within tolerance of real cardinality.

CardinalityConstraintResult dataclass

Cardinality comparison for a single column.

FormatPreservationAnalyzer

Detect format patterns in real data and check synth preserves them.

FormatPreservationResult dataclass

Format preservation metrics for a single string column.

StringSimilarityAnalyzer

Compute character n-gram cosine similarity between real and synth string columns.

StringSimilarityResult dataclass

Character n-gram cosine similarity between string column value distributions.

Tier2Report dataclass

Composite Tier 2 fidelity report.

Methods:
passing_rate()

Fraction of all checks that passed (0.0 - 1.0).

AdvancedProfiler

Runs Tier 1 fidelity profiling on a pair of DataFrames (real + synthetic).

Usage::

profiler = AdvancedProfiler()
adv = profiler.profile_pair(real_df, synth_df, table_name="orders")
print(f"AUC: {adv.adversarial.auc_roc:.3f}")
Methods:
profile_pair(real, synthetic, table_name='table')

Profile real + synthetic DataFrames and return AdvancedTableProfile.

profile_single(df, table_name='table')

Profile a single DataFrame (no adversarial test — needs both real+synth).

AdvancedTableProfile dataclass

Extended profile combining base stats with Tier 1 fidelity features.

AdversarialResult dataclass

Result of the adversarial (distinguishability) test.

Attributes
distinguishability_score property

0 = perfectly indistinguishable, 100 = perfectly distinguishable.

ConditionalProfile dataclass

Conditional statistics for col_a given values of col_b.

GMMFit dataclass

Gaussian Mixture Model fit for a numeric column.

PeriodicityResult dataclass

FFT-based periodicity detection result.

TemporalProfile dataclass

Temporal / sequence analysis for a datetime or sorted numeric column.

Functions:

build_redaction_manifest(dataset_profile, safe_profile, config=None, unsafe=False)

Build the self-describing redaction manifest (ADR-005 / STORY-009).

The manifest is computed from the rich source profile and the scrubbed safe profile together, so it reports what was ACTUALLY suppressed — not what was intended. Accuracy is the AC: every figure is read off the real mapping outcome.

Shape::

{
  "unsafe": <bool>,            # mirrors SafeProfile.unsafe
  "k_default": <int>,          # profile-level k that applied by default
  "tables": {
    <table>: {
      <column>: {
        "categories_dropped": <int>,       # k-anon __OTHER__ folds (rare)
        "bounds_winsorized": <bool>,       # winsorized quantile bounds set
        "pattern_only": <bool>,            # PII-gated to pattern+length only
        "k": <int>,                        # effective k for this column
        "sensitive": <bool>,               # sensitive flag raised k
      }, ...
    }, ...
  }
}

rare_categories_dropped reads the per-column suppressed_category_count the k-anon hook actually recorded (STORY-007). pattern_only re-evaluates the exact PII-gate decision the mapper used (STORY-008). In unsafe mode the effective config disabled both controls, so these report 0 / False — accurately.

safe_profile_to_schema(profile, domain_name='safe_inferred')

Convenience wrapper around :meth:SafeProfileAdapter.to_schema.

check_anomaly_rates(df, expected_fractions=None, tolerance=0.05)

Verify _spindle_is_anomaly rate in a DataFrame.

Parameters:

Name Type Description Default
df DataFrame

DataFrame produced by AnomalyRegistry.inject().

required
expected_fractions dict[str, float] | None

Optional mapping of anomaly_type -> expected fraction. If None, uses overall anomaly rate with expected = 0.0 (no anomalies).

None
tolerance float

Acceptable deviation from expected fraction.

0.05

Returns:

Type Description
AnomalyRateResult | None

AnomalyRateResult or None if no anomaly columns present.

run_tier2(real, synthetic, expected_anomaly_fractions=None)

Run all Tier 2 checks and return a Tier2Report.