Skip to content

safe_profile

sqllocks_spindle.inference.safe_profile

SafeProfile — the versioned, persisted, safe-by-construction profile model.

This is the canonical on-disk transport model (ADR-001 / ADR-007). It is decoupled from the rich in-memory DatasetProfile / ColumnProfile (profiler.py): the rich profile is the source, SafeProfile is the transport.

It carries ONLY the safe-and-sufficient statistic set. By construction it has NO raw-bearing fields — there is no min_value, max_value, enum_values or value_counts_ext anywhere on this model. Raw extremes are replaced by winsorized bounds (populated by STORY-006); rare categories are suppressed into categorical_weights (populated by STORY-007).

STORY-001 scope: the dataclasses + schema_version + byte-stable to_safe_dict / from_safe_dict round-trip. The mapping from a rich profile (STORY-002), the ProfileStore save/load path (STORY-003), the serialization guard on the rich dataclasses (STORY-004).

STORY-009 (ADR-005) adds safe-by-default behaviour: the scrub (winsorize + k-anon + PII gate) runs on the mapping path by DEFAULT; an opt-out unsafe_full_fidelity flag disables the disclosure-control transforms, persists full-fidelity statistics, and stamps unsafe=true on the artifact. Every artifact embeds a self-describing redaction_manifest reporting, per column, what was actually suppressed (rare categories dropped, bounds winsorized, pattern-only columns, the k used, the sensitive flag).

Classes

SafeColumnProfile dataclass

Safe, persisted statistic set for a single column.

Carries ONLY non-raw-bearing statistics. Notably absent (by construction): min_value, max_value, enum_values, value_counts_ext.

Numeric extremes live in bounds (winsorized quantile bounds, ADR-002, populated by STORY-006). Categorical mass lives in categorical_weights (post-k-anon suppression, ADR-003, populated by STORY-007).

Methods:
from_column_profile(col, config=None, row_count=None) classmethod

Map a rich ColumnProfile to a SafeColumnProfile (STORY-002).

Selects ONLY the safe-and-sufficient statistic set. Reads the REAL attribute names on ColumnProfile (min_value/max_value are never read — ADR-002; bounds derive from quantiles). This fixes the B2 attribute-mismatch bug class where the legacy registry read non-existent .min/.max/.top_values.

Disclosure-control transforms are applied via hooks that are STUBS in this story and become real in their owning stories:

  • bounds — winsorized quantile bounds (STORY-006 / ADR-002). Stub here: {"lo": p1, "hi": p99} taken from quantiles if present.
  • categorical_weights — k-anon suppression (STORY-007 / ADR-003): any value with count < k folded into a single __OTHER__ bucket. count is derived from the seeded proportion x row_count (the rich enum_values / value_counts_ext carry value->proportion, not raw counts). row_count is threaded in by the table mapper.
  • value-pattern PII gate (STORY-008 / ADR-004). When a column's detected pattern is a PII class (:pydata:PII_PATTERNS) OR its cardinality is approximately the row count (high-card free-text backstop), the column persists pattern + length_dist ONLY — categorical_ weights are dropped and no values are carried. Detection is name-independent (catches PII in notes / c_47). This is DEFENSE-IN-DEPTH, NOT a completeness guarantee (ADR-004 / ADR-011).
to_safe_dict()

Serialize to a plain dict. Deterministic key order for byte-stability.

SafeTableProfile dataclass

Safe, persisted profile for a single table.

Methods:
from_table_profile(table, config=None) classmethod

Map a rich TableProfile to a SafeTableProfile (STORY-002).

One SafeColumnProfile per column. Carries the table-level correlation_matrix, primary_key and advisory detected_fks (names/overlap only — no raw values). Column order is preserved.

to_safe_dict()

Serialize to a plain dict. Columns serialized in declared order.

SafeProfile dataclass

The canonical, versioned, on-disk safe profile (ADR-001).

Top-level transport object. Carries schema_version and an embedded redaction_manifest (populated by STORY-009 — present but empty here).

Methods:
from_dataset_profile(dataset_profile, config=None, unsafe_full_fidelity=False) classmethod

Map a rich DatasetProfile to a SafeProfile (STORY-002 / ADR-001).

Builds one SafeTableProfile per table and one SafeColumnProfile per column, selecting ONLY the safe-and-sufficient statistic set. The rich profile is the source; the returned SafeProfile is the safe transport.

The mapper reads only REAL attribute names on the rich dataclasses (min_value/max_value are never read — bounds derive from quantiles per ADR-002), fixing the B2 attribute-mismatch bug class.

config is an optional per-profile/per-column settings dict threaded to the disclosure-control hooks (winsorization percentiles, k-anon k, PII gate).

Safe-by-default (ADR-005 / STORY-009)

The scrub — winsorized bounds (ADR-002), k-anon __OTHER__ suppression (ADR-003), and the value-pattern PII gate (ADR-004) — runs by DEFAULT. The safe path is the path of least resistance.

unsafe_full_fidelity=True is the explicit, single opt-out. It disables the disclosure-control transforms (k-anon suppression and the PII gate are turned off so full-fidelity categorical weights / values survive) and stamps unsafe=True on the returned profile. Such an artifact is rejected by validate --safe (STORY-010). It is the ONLY way to persist un-scrubbed statistics.

Every returned profile carries an accurate redaction_manifest (built from the rich source vs. the scrubbed safe columns — see :func:build_redaction_manifest).

to_safe_dict()

Serialize to a plain dict with deterministic key order.

Functions:

build_redaction_manifest(dataset_profile, safe_profile, config=None, unsafe=False)

Build the self-describing redaction manifest (ADR-005 / STORY-009).

The manifest is computed from the rich source profile and the scrubbed safe profile together, so it reports what was ACTUALLY suppressed — not what was intended. Accuracy is the AC: every figure is read off the real mapping outcome.

Shape::

{
  "unsafe": <bool>,            # mirrors SafeProfile.unsafe
  "k_default": <int>,          # profile-level k that applied by default
  "tables": {
    <table>: {
      <column>: {
        "categories_dropped": <int>,       # k-anon __OTHER__ folds (rare)
        "bounds_winsorized": <bool>,       # winsorized quantile bounds set
        "pattern_only": <bool>,            # PII-gated to pattern+length only
        "k": <int>,                        # effective k for this column
        "sensitive": <bool>,               # sensitive flag raised k
      }, ...
    }, ...
  }
}

rare_categories_dropped reads the per-column suppressed_category_count the k-anon hook actually recorded (STORY-007). pattern_only re-evaluates the exact PII-gate decision the mapper used (STORY-008). In unsafe mode the effective config disabled both controls, so these report 0 / False — accurately.