safe_profile
sqllocks_spindle.inference.safe_profile
¶
SafeProfile — the versioned, persisted, safe-by-construction profile model.
This is the canonical on-disk transport model (ADR-001 / ADR-007). It is
decoupled from the rich in-memory DatasetProfile / ColumnProfile
(profiler.py): the rich profile is the source, SafeProfile is the
transport.
It carries ONLY the safe-and-sufficient statistic set. By construction it has
NO raw-bearing fields — there is no min_value, max_value,
enum_values or value_counts_ext anywhere on this model. Raw extremes
are replaced by winsorized bounds (populated by STORY-006); rare categories
are suppressed into categorical_weights (populated by STORY-007).
STORY-001 scope: the dataclasses + schema_version + byte-stable
to_safe_dict / from_safe_dict round-trip. The mapping from a rich
profile (STORY-002), the ProfileStore save/load path (STORY-003), the
serialization guard on the rich dataclasses (STORY-004).
STORY-009 (ADR-005) adds safe-by-default behaviour: the scrub (winsorize +
k-anon + PII gate) runs on the mapping path by DEFAULT; an opt-out
unsafe_full_fidelity flag disables the disclosure-control transforms,
persists full-fidelity statistics, and stamps unsafe=true on the artifact.
Every artifact embeds a self-describing redaction_manifest reporting, per
column, what was actually suppressed (rare categories dropped, bounds
winsorized, pattern-only columns, the k used, the sensitive flag).
Classes¶
SafeColumnProfile
dataclass
¶
Safe, persisted statistic set for a single column.
Carries ONLY non-raw-bearing statistics. Notably absent (by construction):
min_value, max_value, enum_values, value_counts_ext.
Numeric extremes live in bounds (winsorized quantile bounds, ADR-002,
populated by STORY-006). Categorical mass lives in categorical_weights
(post-k-anon suppression, ADR-003, populated by STORY-007).
Methods:¶
from_column_profile(col, config=None, row_count=None)
classmethod
¶
Map a rich ColumnProfile to a SafeColumnProfile (STORY-002).
Selects ONLY the safe-and-sufficient statistic set. Reads the REAL
attribute names on ColumnProfile (min_value/max_value are
never read — ADR-002; bounds derive from quantiles). This fixes
the B2 attribute-mismatch bug class where the legacy registry read
non-existent .min/.max/.top_values.
Disclosure-control transforms are applied via hooks that are STUBS in this story and become real in their owning stories:
bounds— winsorized quantile bounds (STORY-006 / ADR-002). Stub here:{"lo": p1, "hi": p99}taken fromquantilesif present.categorical_weights— k-anon suppression (STORY-007 / ADR-003): any value with count < k folded into a single__OTHER__bucket.countis derived from the seeded proportion xrow_count(the richenum_values/value_counts_extcarry value->proportion, not raw counts).row_countis threaded in by the table mapper.- value-pattern PII gate (STORY-008 / ADR-004). When a column's detected
patternis a PII class (:pydata:PII_PATTERNS) OR its cardinality is approximately the row count (high-card free-text backstop), the column persistspattern+length_distONLY —categorical_ weightsare dropped and no values are carried. Detection is name-independent (catches PII innotes/c_47). This is DEFENSE-IN-DEPTH, NOT a completeness guarantee (ADR-004 / ADR-011).
to_safe_dict()
¶
Serialize to a plain dict. Deterministic key order for byte-stability.
SafeTableProfile
dataclass
¶
Safe, persisted profile for a single table.
Methods:¶
from_table_profile(table, config=None)
classmethod
¶
Map a rich TableProfile to a SafeTableProfile (STORY-002).
One SafeColumnProfile per column. Carries the table-level
correlation_matrix, primary_key and advisory detected_fks
(names/overlap only — no raw values). Column order is preserved.
to_safe_dict()
¶
Serialize to a plain dict. Columns serialized in declared order.
SafeProfile
dataclass
¶
The canonical, versioned, on-disk safe profile (ADR-001).
Top-level transport object. Carries schema_version and an embedded
redaction_manifest (populated by STORY-009 — present but empty here).
Methods:¶
from_dataset_profile(dataset_profile, config=None, unsafe_full_fidelity=False)
classmethod
¶
Map a rich DatasetProfile to a SafeProfile (STORY-002 / ADR-001).
Builds one SafeTableProfile per table and one SafeColumnProfile
per column, selecting ONLY the safe-and-sufficient statistic set. The
rich profile is the source; the returned SafeProfile is the safe
transport.
The mapper reads only REAL attribute names on the rich dataclasses
(min_value/max_value are never read — bounds derive from
quantiles per ADR-002), fixing the B2 attribute-mismatch bug class.
config is an optional per-profile/per-column settings dict threaded
to the disclosure-control hooks (winsorization percentiles, k-anon k,
PII gate).
Safe-by-default (ADR-005 / STORY-009)¶
The scrub — winsorized bounds (ADR-002), k-anon __OTHER__
suppression (ADR-003), and the value-pattern PII gate (ADR-004) — runs
by DEFAULT. The safe path is the path of least resistance.
unsafe_full_fidelity=True is the explicit, single opt-out. It
disables the disclosure-control transforms (k-anon suppression and the
PII gate are turned off so full-fidelity categorical weights / values
survive) and stamps unsafe=True on the returned profile. Such an
artifact is rejected by validate --safe (STORY-010). It is the ONLY
way to persist un-scrubbed statistics.
Every returned profile carries an accurate redaction_manifest (built
from the rich source vs. the scrubbed safe columns — see
:func:build_redaction_manifest).
to_safe_dict()
¶
Serialize to a plain dict with deterministic key order.
Functions:¶
build_redaction_manifest(dataset_profile, safe_profile, config=None, unsafe=False)
¶
Build the self-describing redaction manifest (ADR-005 / STORY-009).
The manifest is computed from the rich source profile and the scrubbed safe profile together, so it reports what was ACTUALLY suppressed — not what was intended. Accuracy is the AC: every figure is read off the real mapping outcome.
Shape::
{
"unsafe": <bool>, # mirrors SafeProfile.unsafe
"k_default": <int>, # profile-level k that applied by default
"tables": {
<table>: {
<column>: {
"categories_dropped": <int>, # k-anon __OTHER__ folds (rare)
"bounds_winsorized": <bool>, # winsorized quantile bounds set
"pattern_only": <bool>, # PII-gated to pattern+length only
"k": <int>, # effective k for this column
"sensitive": <bool>, # sensitive flag raised k
}, ...
}, ...
}
}
rare_categories_dropped reads the per-column
suppressed_category_count the k-anon hook actually recorded (STORY-007).
pattern_only re-evaluates the exact PII-gate decision the mapper used
(STORY-008). In unsafe mode the effective config disabled both controls,
so these report 0 / False — accurately.