Skip to content

correlation

sqllocks_spindle.engine.correlation

Gaussian copula post-pass — enforce column correlations without changing marginals.

Classes

GaussianCopula

Reorder column values to achieve target Pearson correlations.

Algorithm (rank-based Gaussian copula): 1. For each numeric column, map values to ranks, then to uniform [0,1]. 2. Apply inverse normal CDF (probit) → correlated Gaussian space. 3. Cholesky decompose target correlation matrix → apply linear transform. 4. Map back to uniform via normal CDF → back to original values via rank lookup.

This preserves each column's marginal distribution exactly while inducing the target pairwise correlations.

Parameters:

Name Type Description Default
correlation_matrix dict[str, dict[str, float]]

dict of {col_a: {col_b: r}} pairs.

required
threshold float

Skip pairs where |r| < threshold (default 0.5).

0.5
seed int | None

Random seed for reproducibility (default None, uses system entropy).

None
Methods:
apply(df)

Apply the copula reordering to a DataFrame. Returns a new DataFrame.

Note: the copula reorders each participating column independently, which breaks any row-aligned key relationship. PK and FK columns are therefore EXCLUDED from the copula by convention (3.0.0 audit fix): any column whose name ends with _id, _pk, _fk, or is literally id / pk is skipped.