Entity: EditRuleConfig¶
Definition¶
EditRuleConfig is a Pydantic discriminated union of four detection rule types that together implement a minimal Fellegi-Holt edit-and-imputation cascade for NASS-style commodity survey data. Each rule pairs a detection predicate (ratio out of tolerance, range check, null check) with a corrective EditOperation. Rules are declared in YAML under YieldsBuilder.edits and applied sequentially by apply_edits before the feature pivot in the features-assembly stage.
The union is defined at edit.py:361:
EditRuleConfig = Annotated[
RatioEditRule | RangeEditRule | NullImputeRule | PanelNullImputeRule,
Field(discriminator="kind"),
]
There is a parallel discriminated union for corrective operations (EditOperation at edit.py:242), which has six members documented below.
Kind¶
Pydantic discriminated union (detection rules, discriminator kind). Each member carries its own on_fail: EditOperation field that selects the corrective action.
Source of truth¶
market_insights_models/src/commodity_hindcast/lib/edit_and_imputation/edit.py
Consumed by config.py:190 (YieldsBuilder.edits: list[EditRuleConfig]) and applied at features build time via apply_edits (edit.py:383).
Detection rule members¶
All four rule classes inherit from _EditRuleBase (edit.py:258), which carries:
| Field | Type | Description |
|---|---|---|
name |
str |
Unique label used in EditReport and log messages |
target |
str |
Column name on which detection and corrective action act |
on_fail |
EditOperation |
Corrective action to apply on firing rows |
RatioEditRule¶
File:LINE: edit.py:269
Purpose: Fires when target / eval(derive) falls outside the symmetric tolerance band [1/tolerance, tolerance] — the canonical cross-column balance edit (e.g. yield vs production/area).
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
kind |
Literal["ratio_edit"] |
Discriminator |
derive |
str |
pandas.DataFrame.eval-compatible expression |
tolerance |
float |
Must be > 1; the band is [1/tol, tol] around 1 |
The _tolerance_gt_one validator (edit.py:281) rejects tolerance <= 1.0 at config load time.
When used: Any YAML rule with kind: ratio_edit under YieldsBuilder.edits. Typical application: checking that reported yield_kg_ha ≈ production_kg / area_harvested_ha to within a configurable threshold.
RangeEditRule¶
File:LINE: edit.py:306
Purpose: Fires when target falls outside a declared [min, max] interval — the canonical hard-edit constraint (e.g. reject yield values that are physically impossible).
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
kind |
Literal["range_edit"] |
Discriminator |
min |
float \| None |
Lower bound; None skips lower check |
max |
float \| None |
Upper bound; None skips upper check |
At least one of min / max must be set (_at_least_one_bound validator, edit.py:313). NaN target values do not fire (masked by fillna(False), edit.py:326).
When used: Yield plausibility gates; e.g. reject area_harvested_ha <= 0.
NullImputeRule¶
File:LINE: edit.py:329
Purpose: Fires when target is null. The canonical pairing is on_fail: DeductiveImpute — the first step of the Fellegi-Holt imputation cascade.
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
kind |
Literal["null_impute"] |
Discriminator |
No additional fields beyond _EditRuleBase. The detect_fires implementation is df[self.target].isna() (edit.py:342).
When used: Filling yield_kg_ha when it is null but production_kg and area_harvested_ha are known.
PanelNullImputeRule¶
File:LINE: edit.py:345
Purpose: Fires when target is null; unlike NullImputeRule, the corrective operation may reference other rows in the same panel group (typically geo_identifier × year). Required by PanelTrailingMedian — that operation type-checks for this rule class and raises TypeError if it receives a plain NullImputeRule.
Pydantic fields:
| Field | Type | Default | Notes |
|---|---|---|---|
kind |
Literal["panel_null_impute"] |
— | Discriminator |
group_by |
str |
"geo_identifier" |
Panel grouping column |
order_by |
str |
"year" |
Time-ordering column within the group |
When used: Imputing area_harvested_ha from trailing county-level history using PanelTrailingMedian.
Discriminated union members — EditOperation¶
EditOperation (edit.py:242) is the second discriminated union in this module, selecting the corrective action applied to rows where a detection rule fires. It is discriminated on the operation field.
EditOperation = Annotated[
DeductiveImpute | Clip | Flag | Drop | Fail | PanelTrailingMedian,
Field(discriminator="operation"),
]
Clip¶
File:LINE: edit.py:113
Purpose: Winsorises target to [min, max] on firing rows only; non-firing rows are left unchanged.
Pydantic fields:
| Field | Type | Default | Notes |
|---|---|---|---|
operation |
Literal["clip"] |
"clip" |
Discriminator |
min |
float \| None |
None |
Lower clip bound |
max |
float \| None |
None |
Upper clip bound |
At least one of min / max required (_at_least_one_bound, edit.py:126). The apply method uses pd.Series.clip(lower, upper) on the fires index slice only (edit.py:143).
When used: Capping implausible ratio outliers to a plausible range rather than dropping the row.
Flag¶
File:LINE: edit.py:149
Purpose: Records the fire in EditReport.flags and leaves the value untouched. A diagnostic-only action: no data is altered.
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
operation |
Literal["flag"] |
Discriminator |
apply returns df unchanged (edit.py:162).
When used: Surfacing suspicious values for downstream inspection without altering the pipeline output.
Drop¶
File:LINE: edit.py:165
Purpose: Removes firing rows from the working frame via df.loc[~fires].copy() (edit.py:178). Dropped rows appear as False in later rules' flag columns (because flags is always reindexed to the original input, edit.py:405).
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
operation |
Literal["drop"] |
Discriminator |
When used: Deleting records whose values are so implausible that no imputation is appropriate.
Fail¶
File:LINE: edit.py:181
Purpose: Raises ValueError if any row fires. A hard gate: the pipeline cannot continue if this rule fires.
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
operation |
Literal["fail"] |
Discriminator |
apply calls fires.any() and raises with a message quoting rule.name and the fire count (edit.py:195).
When used: Asserting invariants that must hold before training begins (e.g. a pre-condition that area_harvested_ha is never negative after earlier edits).
PanelTrailingMedian¶
File:LINE: edit.py:201
Purpose: Imputes firing rows with the per-geo trailing median by delegating to impute_missing_area in imputation.py. This is the only operation that crosses row boundaries and the only one that requires a PanelNullImputeRule as its detection rule.
Pydantic fields:
| Field | Type | Default | Notes |
|---|---|---|---|
operation |
Literal["panel_trailing_median"] |
"panel_trailing_median" |
Discriminator |
lookback_years |
int |
3 |
Number of trailing years for the median window |
strictly_causal |
bool |
True |
When True, fires rows are excluded from the history used to compute the median |
apply uses a local import of imputation.impute_missing_area to avoid a circular import (edit.py:224). When strictly_causal=True, only non-firing rows form the historical panel (edit.py:231).
When used: Imputing missing area_harvested_ha from a county's own trailing history.
DeductiveImpute¶
File:LINE: edit.py:91
Purpose: Replaces target on firing rows with a value derived from pandas.DataFrame.eval(source) — the canonical algebraic imputation step (e.g. yield_kg_ha = production_kg / area_harvested_ha).
Pydantic fields:
| Field | Type | Notes |
|---|---|---|
operation |
Literal["deductive_impute"] |
Discriminator |
source |
str |
DataFrame.eval-compatible expression |
apply calls cast(pd.Series, df.eval(self.source)) and assigns the result to out.loc[fires, rule.target] (edit.py:107–109).
When used: Filling null yield_kg_ha from known production_kg and area_harvested_ha; the first step in the Fellegi-Holt cascade.
Why these members exist¶
The module follows the Fellegi-Holt edit-and-imputation paradigm (Fellegi & Holt, 1976), adopted by USDA NASS, Statistics Canada, and UNECE for official survey data editing. Under this paradigm, raw survey records first pass through edit rules that detect inconsistencies, then through an imputation cascade that resolves them with minimal alteration — deductive imputation first (algebra from known fields), then donor or regression imputation. The six EditOperation variants in this module implement a practical subset sufficient for commodity-hindcast preprocessing: deductive imputation and panel-trailing-median cover the two imputation tiers; clip and drop handle out-of-range values; flag supports diagnostic inspection without data alteration; fail enforces hard pre-conditions that must hold before any model is trained. This design mirrors the validate / deductive / simputation R packages (Statistics Netherlands, van der Loo and de Jonge) that are the de-facto reference implementations of the paradigm.
Order of evaluation¶
Rules fire in YAML declaration order. apply_edits (edit.py:383) iterates the rules sequence, calling rule.detect_fires(out) against the current working frame out and then rule.on_fail.apply(out, rule, fires). Because each rule receives the output of the previous one, a Drop in position N shrinks the frame seen by rules N+1, N+2, …. The EditReport.flags DataFrame is always reindexed to the original input (edit.py:405), so dropped rows appear as False in all subsequent flag columns and callers can recover the full firing picture against the unchanged input index.
Note on imputation.py¶
edit.py handles row-level edits on raw survey data. Its companion imputation.py provides feature-level NaN fills for the detrenders and regressors. The two modules are complementary and non-overlapping in responsibility: edit.py operates on raw survey tables before the feature pivot; imputation.py (MedianImputer, impute_missing_panel_columns) operates on the assembled feature matrices during FIT and PREDICT. PanelTrailingMedian.apply is the sole coupling point — it calls impute_missing_area from imputation.py via a local import to avoid a circular dependency (edit.py:224). See source: lib for full imputation.py coverage.
Lifecycle¶
ExperimentConfigis loaded from YAML.YieldsBuilder.editsdeserialises each rule entry via theEditRuleConfigdiscriminated union; thekindfield selects the concrete rule class.- During the features-build stage (
cli run features), the yields builder callsapply_edits(raw_df, cfg.builders["yields"].edits). apply_editsreturns(edited_df, EditReport). The edited frame proceeds to the feature pivot; the report is logged but not persisted to disk.- The assembled feature parquets (
fit.parquet,pred.parquet) are the artefacts that downstream FIT / PREDICT stages consume.
Relationships¶
- Owned by FeatureBuilderConfig via
YieldsBuilder.edits: list[EditRuleConfig](config.py:190). PanelTrailingMediandelegates toimpute_missing_areainimputation.py(edit.py:232).EditReport(edit.py:370) is the companion output dataclass produced byapply_editsalongside the edited frame.
Concepts and pipelines¶
- Fellegi-Holt edit-and-imputation paradigm (concept, to be written).
- Features pipeline (pipeline, to be written) — step where
apply_editsis invoked.
PRs and commits¶
No dedicated PR found for the introduction of EditRuleConfig; the module predates the tracked PR window. PanelTrailingMedian and PanelNullImputeRule were added to support area imputation in the forecast path.
Open questions¶
- Operational thresholds (e.g. ratio tolerance per commodity) are visible in YAML configs but not audited here; a separate page could document the per-commodity rule declarations.
- The
strictly_causaldefault ofTrueonPanelTrailingMedianmeans firing rows do not contribute to their own imputation history. Whether this is the intended semantics for panel rows in the same year (as opposed to prior years) has not been formally documented. EditReportis logged but not persisted. If rules fire frequently, a parquet artefact underrun_dir/would make post-hoc auditing easier.