Skip to content

tier3_research

sqllocks_spindle.inference.tier3_research

Tier 3 research-grade fidelity features.

Experimental features that raise the synthetic data quality ceiling:

  • ChowLiuNetwork — Bayesian network structure learning via Chow-Liu algorithm
  • DifferentialPrivacy — Laplace/Gaussian noise injection for (ε,δ)-DP
  • DriftMonitor — Statistical drift detection between two DataFrames
  • BootstrapMode — Bootstrap-based synthetic generation from real data
  • CTGANWrapper — Optional wrapper around sdv/ctgan when installed

All features fail gracefully when optional dependencies (sdv, sklearn) are absent.

Classes

BayesianEdge dataclass

A directed edge in the Chow-Liu tree.

ChowLiuResult dataclass

Result of Chow-Liu tree structure learning.

ChowLiuNetwork

Learn a Bayesian network tree structure using the Chow-Liu algorithm.

Computes pairwise mutual information between columns and finds the maximum spanning tree — the tree that best represents the joint distribution.

This is the theoretical backbone of synthetic data that preserves inter-column dependencies.

Methods:
fit(df)

Learn the Chow-Liu tree from a DataFrame.

DPResult dataclass

Result of applying differential privacy noise.

DifferentialPrivacy

Apply Laplace or Gaussian noise to achieve (ε,δ)-differential privacy.

For synthetic data, this adds calibrated noise to numeric columns proportional to their sensitivity / ε, ensuring individual records cannot be re-identified.

Methods:
apply(df, rng=None)

Apply differential privacy noise to all numeric columns.

Returns (noised_df, DPResult).

ColumnDriftResult dataclass

Drift result for a single column.

DriftReport dataclass

Drift report comparing a reference and current DataFrame.

DriftMonitor

Detect statistical drift between reference and current DataFrames.

Uses KS test for numeric columns, Chi-squared for categoricals, and PSI as a supplementary signal.

Methods:
compare(reference, current)

Compare reference and current DataFrames for drift.

BootstrapResult dataclass

Result of bootstrap synthetic generation.

BootstrapMode

Generate synthetic data by bootstrapping (sampling with replacement) from real data.

The simplest form of synthetic generation — preserves all real distributions exactly, but does not generalize beyond the source data. Useful as a baseline.

Methods:
generate(source, n_rows=None, table_name='table', seed=42)

Generate synthetic DataFrame by bootstrapping source.

Parameters:

Name Type Description Default
source DataFrame

Real data to bootstrap from.

required
n_rows int | None

Number of rows to generate (default: same as source).

None
table_name str

Name for result metadata.

'table'
seed int

Random seed.

42

Returns:

Type Description
tuple[DataFrame, BootstrapResult]

(synthetic_df, BootstrapResult)

CTGANWrapper

Optional wrapper around CTGAN/TVAE from the sdv library.

Falls back gracefully if sdv is not installed. When available, CTGAN provides deep generative model quality for tabular data.

Install with: pip install sqllocks-spindle[deep]

Methods:
fit(df, discrete_columns=None)

Fit the CTGAN model on real data.

sample(n_rows)

Sample from the fitted CTGAN model.