correlation
sqllocks_spindle.engine.correlation
¶
Gaussian copula post-pass — enforce column correlations without changing marginals.
Classes¶
GaussianCopula
¶
Reorder column values to achieve target Pearson correlations.
Algorithm (rank-based Gaussian copula): 1. For each numeric column, map values to ranks, then to uniform [0,1]. 2. Apply inverse normal CDF (probit) → correlated Gaussian space. 3. Cholesky decompose target correlation matrix → apply linear transform. 4. Map back to uniform via normal CDF → back to original values via rank lookup.
This preserves each column's marginal distribution exactly while inducing the target pairwise correlations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
correlation_matrix
|
dict[str, dict[str, float]]
|
dict of {col_a: {col_b: r}} pairs. |
required |
threshold
|
float
|
Skip pairs where |r| < threshold (default 0.5). |
0.5
|
seed
|
int | None
|
Random seed for reproducibility (default None, uses system entropy). |
None
|
Methods:¶
apply(df)
¶
Apply the copula reordering to a DataFrame. Returns a new DataFrame.
Note: the copula reorders each participating column independently,
which breaks any row-aligned key relationship. PK and FK columns are
therefore EXCLUDED from the copula by convention (3.0.0 audit fix):
any column whose name ends with _id, _pk, _fk, or is
literally id / pk is skipped.