inference
sqllocks_spindle.inference
¶
Spindle inference engine — profile existing data and infer schemas.
Provides DataProfiler for analysing DataFrames and SchemaBuilder for converting profiles into ready-to-use SpindleSchema objects. Also includes FidelityComparator for comparing real vs synthetic data.
Classes¶
DataMasker
¶
Replace PII in real data with synthetic values preserving distributions.
MaskConfig
dataclass
¶
Configuration for data masking.
MaskResult
dataclass
¶
ColumnFidelity
dataclass
¶
Fidelity metrics for a single column.
FidelityComparator
¶
FidelityReport
dataclass
¶
Complete fidelity report comparing real vs synthetic data.
Methods:¶
summary()
¶
Generate a plain-text summary.
to_markdown()
¶
Generate markdown report.
failing_columns(threshold=85.0)
¶
Return (table, column, score) tuples for columns below threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
Score threshold (0-100). Columns with score < threshold are included. |
85.0
|
Returns:
| Type | Description |
|---|---|
list[tuple[str, str, float]]
|
List of (table_name, column_name, score) tuples, sorted by score (lowest first). |
to_dict()
¶
Return a JSON-serializable dict representation.
to_dataframe()
¶
Return a flat pandas DataFrame with one row per column.
to_html(title='Spindle Fidelity Report')
¶
Render fidelity report as a self-contained HTML page.
Uses inline CSS — no external dependencies. Score bands: green ≥ 85, amber 70-84, red < 70.
score(real, synthetic, table_name='table', threshold=85.0)
classmethod
¶
Compare two DataFrames and return a FidelityReport.
Convenience classmethod for single-table comparison.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
real
|
'pd.DataFrame'
|
Real data DataFrame. |
required |
synthetic
|
'pd.DataFrame'
|
Synthetic data DataFrame to compare. |
required |
table_name
|
str
|
Name for the table in the report (default: "table"). |
'table'
|
threshold
|
float
|
Score threshold for failing_columns() (default: 85.0). |
85.0
|
Returns:
| Type | Description |
|---|---|
'FidelityReport'
|
FidelityReport comparing the two DataFrames. |
TableFidelity
dataclass
¶
Fidelity metrics for a table.
ColumnProfile
dataclass
¶
Statistical profile of a single column.
DataProfiler
¶
Analyse one or more DataFrames and produce profiles.
Methods:¶
profile_dataframe(df, table_name='table')
¶
Profile a single DataFrame.
profile_dataset(tables)
¶
Profile a dict of DataFrames and detect cross-table relationships.
profile(df, table_name='table')
¶
Alias for profile_dataframe(). Profile a single DataFrame.
from_csv(path, table_name=None, sample_rows=None, **kwargs)
classmethod
¶
Profile a CSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the CSV file. |
required |
table_name
|
str | None
|
Name for the table profile. Defaults to the filename stem. |
None
|
sample_rows
|
int | None
|
If set, sample this many rows before profiling. |
None
|
**kwargs
|
Passed to DataProfiler constructor (fit_threshold, top_n_values, etc.). |
{}
|
DatasetProfile
dataclass
¶
Profile of a multi-table dataset.
TableProfile
dataclass
¶
Profile of a single table (DataFrame).
ExportedProfile
dataclass
¶
A portable profile that can be imported into any domain.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Profile identifier (e.g. |
description |
str
|
Human-readable description of what this profile represents. |
source_domain |
str
|
Name of the domain this profile was exported from
(or |
distributions |
dict[str, dict[str, float]]
|
Mapping of |
ratios |
dict[str, float]
|
Mapping of ratio names to float multipliers. |
metadata |
dict[str, Any]
|
Arbitrary extra information (row counts, column types, etc.). |
ProfileIO
¶
Export, import, and list domain profiles.
All public methods are stateless — no configuration is stored on the
instance. Instantiate with ProfileIO() and call methods directly.
Example::
io = ProfileIO()
io.export_profile(RetailDomain(), Path("retail_profile.json"))
io.import_profile(Path("retail_profile.json"), HealthcareDomain(), save_as="from_retail")
io.list_profiles(RetailDomain())
Methods:¶
export_profile(domain, output_path, profile_name='default')
¶
Export a domain's active profile to a standalone JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
Any
|
A :class: |
required |
output_path
|
str | Path
|
Destination file path (created if it does not exist). |
required |
profile_name
|
str
|
Label stored in the exported metadata. |
'default'
|
Returns:
| Type | Description |
|---|---|
Path
|
The resolved :class: |
import_profile(profile_path, target_domain, save_as=None)
¶
Import an exported profile into a target domain's profiles/ directory.
The imported file is converted to the standard domain profile format
(i.e. metadata is stripped; only name, description,
distributions, and ratios are kept).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
profile_path
|
str | Path
|
Path to an exported profile JSON file. |
required |
target_domain
|
Any
|
The domain instance to import into. |
required |
save_as
|
str | None
|
Override the profile name (and filename). When None the
name is taken from the file's |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The name the profile was saved as. |
list_profiles(domain)
¶
List all profiles available for a domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
Any
|
A :class: |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, str | int]]
|
A list of dicts with keys |
list[dict[str, str | int]]
|
|
from_dataframe(df, table_name='table', name='inferred')
¶
Create a profile by inferring distributions from a DataFrame.
Categorical columns (object dtype or low cardinality) are converted into normalised distribution weights. High-cardinality columns are skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The source DataFrame. |
required |
table_name
|
str
|
Prefix for distribution keys ( |
'table'
|
name
|
str
|
Name to assign to the resulting profile. |
'inferred'
|
Returns:
| Name | Type | Description |
|---|---|---|
An |
ExportedProfile
|
class: |
ProfileStore
¶
Persist and retrieve a :class:SafeProfile to/from a JSON file.
All methods are stateless — instantiate with ProfileStore() and call
directly, or use the classmethods. This is the only supported public
on-disk entrypoint for a SafeProfile (ADR-001 / ADR-007).
Methods:¶
save(profile, path)
classmethod
¶
Write profile to path as JSON (via to_safe_dict).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
profile
|
SafeProfile
|
The :class: |
required |
path
|
str | Path
|
Destination file path. Parent directories are created. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
The resolved :class: |
load(path)
classmethod
¶
Read a SafeProfile from a JSON file written by :meth:save.
A file whose schema_version is not the version this code writes
(e.g. a legacy artifact with no/old version, or a future version) is
loaded read-only with a warning — it never crashes. The returned
object is degraded-but-usable: the keys present are reconstructed, and
the loaded schema_version is preserved on the returned object so a
caller can detect the mismatch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a JSON file previously written by :meth: |
required |
Returns:
| Type | Description |
|---|---|
SafeProfile
|
A reconstructed :class: |
SafeColumnProfile
dataclass
¶
Safe, persisted statistic set for a single column.
Carries ONLY non-raw-bearing statistics. Notably absent (by construction):
min_value, max_value, enum_values, value_counts_ext.
Numeric extremes live in bounds (winsorized quantile bounds, ADR-002,
populated by STORY-006). Categorical mass lives in categorical_weights
(post-k-anon suppression, ADR-003, populated by STORY-007).
Methods:¶
from_column_profile(col, config=None, row_count=None)
classmethod
¶
Map a rich ColumnProfile to a SafeColumnProfile (STORY-002).
Selects ONLY the safe-and-sufficient statistic set. Reads the REAL
attribute names on ColumnProfile (min_value/max_value are
never read — ADR-002; bounds derive from quantiles). This fixes
the B2 attribute-mismatch bug class where the legacy registry read
non-existent .min/.max/.top_values.
Disclosure-control transforms are applied via hooks that are STUBS in this story and become real in their owning stories:
bounds— winsorized quantile bounds (STORY-006 / ADR-002). Stub here:{"lo": p1, "hi": p99}taken fromquantilesif present.categorical_weights— k-anon suppression (STORY-007 / ADR-003): any value with count < k folded into a single__OTHER__bucket.countis derived from the seeded proportion xrow_count(the richenum_values/value_counts_extcarry value->proportion, not raw counts).row_countis threaded in by the table mapper.- value-pattern PII gate (STORY-008 / ADR-004). When a column's detected
patternis a PII class (:pydata:PII_PATTERNS) OR its cardinality is approximately the row count (high-card free-text backstop), the column persistspattern+length_distONLY —categorical_ weightsare dropped and no values are carried. Detection is name-independent (catches PII innotes/c_47). This is DEFENSE-IN-DEPTH, NOT a completeness guarantee (ADR-004 / ADR-011).
to_safe_dict()
¶
Serialize to a plain dict. Deterministic key order for byte-stability.
SafeProfile
dataclass
¶
The canonical, versioned, on-disk safe profile (ADR-001).
Top-level transport object. Carries schema_version and an embedded
redaction_manifest (populated by STORY-009 — present but empty here).
Methods:¶
from_dataset_profile(dataset_profile, config=None, unsafe_full_fidelity=False)
classmethod
¶
Map a rich DatasetProfile to a SafeProfile (STORY-002 / ADR-001).
Builds one SafeTableProfile per table and one SafeColumnProfile
per column, selecting ONLY the safe-and-sufficient statistic set. The
rich profile is the source; the returned SafeProfile is the safe
transport.
The mapper reads only REAL attribute names on the rich dataclasses
(min_value/max_value are never read — bounds derive from
quantiles per ADR-002), fixing the B2 attribute-mismatch bug class.
config is an optional per-profile/per-column settings dict threaded
to the disclosure-control hooks (winsorization percentiles, k-anon k,
PII gate).
Safe-by-default (ADR-005 / STORY-009)¶
The scrub — winsorized bounds (ADR-002), k-anon __OTHER__
suppression (ADR-003), and the value-pattern PII gate (ADR-004) — runs
by DEFAULT. The safe path is the path of least resistance.
unsafe_full_fidelity=True is the explicit, single opt-out. It
disables the disclosure-control transforms (k-anon suppression and the
PII gate are turned off so full-fidelity categorical weights / values
survive) and stamps unsafe=True on the returned profile. Such an
artifact is rejected by validate --safe (STORY-010). It is the ONLY
way to persist un-scrubbed statistics.
Every returned profile carries an accurate redaction_manifest (built
from the rich source vs. the scrubbed safe columns — see
:func:build_redaction_manifest).
to_safe_dict()
¶
Serialize to a plain dict with deterministic key order.
SafeTableProfile
dataclass
¶
Safe, persisted profile for a single table.
Methods:¶
from_table_profile(table, config=None)
classmethod
¶
Map a rich TableProfile to a SafeTableProfile (STORY-002).
One SafeColumnProfile per column. Carries the table-level
correlation_matrix, primary_key and advisory detected_fks
(names/overlap only — no raw values). Column order is preserved.
to_safe_dict()
¶
Serialize to a plain dict. Columns serialized in declared order.
SafeProfileAdapter
¶
Adapt a loaded :class:SafeProfile to a generatable :class:SpindleSchema.
Stateless; instantiate and call :meth:to_schema, or use the module-level
:func:safe_profile_to_schema convenience wrapper.
Methods:¶
to_schema(profile, domain_name='safe_inferred')
¶
Build a :class:SpindleSchema from a loaded :class:SafeProfile.
The returned schema is ready to pass to
Spindle().generate(schema=..., fidelity_profile=profile).
Raw fields are never consulted (there are none on the safe model);
numeric clipping is driven by the winsorized bounds (ADR-002).
SafeProfileValidator
¶
Structural, fail-closed static leak scanner over a serialized artifact.
Usage::
result = SafeProfileValidator().validate_file("profile.json")
sys.exit(result.exit_code)
ValidationFinding
dataclass
¶
A single leak finding, with the JSON path that triggered it.
ValidationResult
dataclass
¶
SchemaBuilder
¶
LakehouseProfiler
¶
Profile Fabric Lakehouse Delta tables and return TableProfile objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workspace_id
|
str
|
Fabric workspace GUID. |
required |
lakehouse_id
|
str
|
Fabric lakehouse GUID. |
required |
token_provider
|
Any | None
|
A callable returning an Azure access token string. Defaults to DefaultAzureCredential when azure-identity is installed. |
None
|
default_sample_rows
|
int | None
|
Row limit for profiling. Pass None to scan entire table. |
100000
|
Methods:¶
profile_table(table_name, sample_rows='default')
¶
Profile a single Delta table.
profile_all(sample_rows='default')
¶
Profile all tables in the lakehouse.
detect_foreign_keys(table_names=None, overlap_threshold=0.9, sample_rows='default', full_scan=False)
¶
Sampled cross-table FK detection (advisory). ADR-009 / STORY-016.
Reads each table's columns (sampled by default) and runs the proven
DataProfiler._detect_foreign_keys_advisory core (naming *_id plus
value-overlap >= overlap_threshold) across every table pair. Detected
FKs are advisory and reported with the measured overlap; a declared
star_map / RelationshipDef remains authoritative and overrides
(resolved by the caller, not here).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_names
|
list[str] | None
|
Tables to scan. Defaults to all tables in the lakehouse. |
None
|
overlap_threshold
|
float
|
Minimum child-to-parent value overlap to report a FK (default 0.9, configurable per ADR-009). |
0.9
|
sample_rows
|
int | None | str
|
Per-table row cap used when reading key columns.
|
'default'
|
full_scan
|
bool
|
Read entire tables (no sampling) to confirm a sampled result (ADR-009 full-scan option). |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, dict[str, Any]]]
|
``{child_table: {col_name: {"parent_table": str, "overlap": float, |
dict[str, dict[str, dict[str, Any]]]
|
"advisory": True, "full_scan": bool}}}`` for every detected FK. |
reconcile_declared_foreign_keys(detected, declared)
staticmethod
¶
Declared FKs override detected advisory FKs (ADR-009 / STORY-017).
A declared star_map / RelationshipDef is AUTHORITATIVE: where a
declaration exists for a (child_table, child_col) it wins over any
detected FK, even a high-overlap one. Detected FKs that a declaration
overrode are REPORTED (not silently dropped) for transparency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
detected
|
dict[str, dict[str, dict[str, Any]]]
|
the output of :meth: |
required |
declared
|
Any
|
iterable of |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
|
dict[str, Any]
|
Resolved declared entries carry |
BootstrapMode
¶
Generate synthetic data by bootstrapping (sampling with replacement) from real data.
The simplest form of synthetic generation — preserves all real distributions exactly, but does not generalize beyond the source data. Useful as a baseline.
Methods:¶
generate(source, n_rows=None, table_name='table', seed=42)
¶
Generate synthetic DataFrame by bootstrapping source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
DataFrame
|
Real data to bootstrap from. |
required |
n_rows
|
int | None
|
Number of rows to generate (default: same as source). |
None
|
table_name
|
str
|
Name for result metadata. |
'table'
|
seed
|
int
|
Random seed. |
42
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, BootstrapResult]
|
(synthetic_df, BootstrapResult) |
BootstrapResult
dataclass
¶
Result of bootstrap synthetic generation.
BayesianEdge
dataclass
¶
A directed edge in the Chow-Liu tree.
ChowLiuNetwork
¶
Learn a Bayesian network tree structure using the Chow-Liu algorithm.
Computes pairwise mutual information between columns and finds the maximum spanning tree — the tree that best represents the joint distribution.
This is the theoretical backbone of synthetic data that preserves inter-column dependencies.
ChowLiuResult
dataclass
¶
Result of Chow-Liu tree structure learning.
CTGANWrapper
¶
Optional wrapper around CTGAN/TVAE from the sdv library.
Falls back gracefully if sdv is not installed. When available, CTGAN provides deep generative model quality for tabular data.
Install with: pip install sqllocks-spindle[deep]
DifferentialPrivacy
¶
Apply Laplace or Gaussian noise to achieve (ε,δ)-differential privacy.
For synthetic data, this adds calibrated noise to numeric columns proportional to their sensitivity / ε, ensuring individual records cannot be re-identified.
DPResult
dataclass
¶
Result of applying differential privacy noise.
DriftMonitor
¶
DriftReport
dataclass
¶
Drift report comparing a reference and current DataFrame.
ColumnDriftResult
dataclass
¶
Drift result for a single column.
AnomalyRateResult
dataclass
¶
Checks whether the injected anomaly rate matches the registered anomaly fraction.
CardinalityConstraintChecker
¶
Check that synthetic cardinality stays within tolerance of real cardinality.
CardinalityConstraintResult
dataclass
¶
Cardinality comparison for a single column.
FormatPreservationAnalyzer
¶
Detect format patterns in real data and check synth preserves them.
FormatPreservationResult
dataclass
¶
Format preservation metrics for a single string column.
StringSimilarityAnalyzer
¶
Compute character n-gram cosine similarity between real and synth string columns.
StringSimilarityResult
dataclass
¶
Character n-gram cosine similarity between string column value distributions.
Tier2Report
dataclass
¶
AdvancedProfiler
¶
Runs Tier 1 fidelity profiling on a pair of DataFrames (real + synthetic).
Usage::
profiler = AdvancedProfiler()
adv = profiler.profile_pair(real_df, synth_df, table_name="orders")
print(f"AUC: {adv.adversarial.auc_roc:.3f}")
AdvancedTableProfile
dataclass
¶
Extended profile combining base stats with Tier 1 fidelity features.
AdversarialResult
dataclass
¶
ConditionalProfile
dataclass
¶
Conditional statistics for col_a given values of col_b.
GMMFit
dataclass
¶
Gaussian Mixture Model fit for a numeric column.
PeriodicityResult
dataclass
¶
FFT-based periodicity detection result.
TemporalProfile
dataclass
¶
Temporal / sequence analysis for a datetime or sorted numeric column.
Functions:¶
build_redaction_manifest(dataset_profile, safe_profile, config=None, unsafe=False)
¶
Build the self-describing redaction manifest (ADR-005 / STORY-009).
The manifest is computed from the rich source profile and the scrubbed safe profile together, so it reports what was ACTUALLY suppressed — not what was intended. Accuracy is the AC: every figure is read off the real mapping outcome.
Shape::
{
"unsafe": <bool>, # mirrors SafeProfile.unsafe
"k_default": <int>, # profile-level k that applied by default
"tables": {
<table>: {
<column>: {
"categories_dropped": <int>, # k-anon __OTHER__ folds (rare)
"bounds_winsorized": <bool>, # winsorized quantile bounds set
"pattern_only": <bool>, # PII-gated to pattern+length only
"k": <int>, # effective k for this column
"sensitive": <bool>, # sensitive flag raised k
}, ...
}, ...
}
}
rare_categories_dropped reads the per-column
suppressed_category_count the k-anon hook actually recorded (STORY-007).
pattern_only re-evaluates the exact PII-gate decision the mapper used
(STORY-008). In unsafe mode the effective config disabled both controls,
so these report 0 / False — accurately.
safe_profile_to_schema(profile, domain_name='safe_inferred')
¶
Convenience wrapper around :meth:SafeProfileAdapter.to_schema.
check_anomaly_rates(df, expected_fractions=None, tolerance=0.05)
¶
Verify _spindle_is_anomaly rate in a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame produced by AnomalyRegistry.inject(). |
required |
expected_fractions
|
dict[str, float] | None
|
Optional mapping of anomaly_type -> expected fraction. If None, uses overall anomaly rate with expected = 0.0 (no anomalies). |
None
|
tolerance
|
float
|
Acceptable deviation from expected fraction. |
0.05
|
Returns:
| Type | Description |
|---|---|
AnomalyRateResult | None
|
AnomalyRateResult or None if no anomaly columns present. |
run_tier2(real, synthetic, expected_anomaly_fractions=None)
¶
Run all Tier 2 checks and return a Tier2Report.