Skip to content

Entity: EditRuleConfig

Definition

EditRuleConfig is a Pydantic discriminated union of four detection rule types that together implement a minimal Fellegi-Holt edit-and-imputation cascade for NASS-style commodity survey data. Each rule pairs a detection predicate (ratio out of tolerance, range check, null check) with a corrective EditOperation. Rules are declared in YAML under YieldsBuilder.edits and applied sequentially by apply_edits before the feature pivot in the features-assembly stage.

The union is defined at edit.py:361:

EditRuleConfig = Annotated[
    RatioEditRule | RangeEditRule | NullImputeRule | PanelNullImputeRule,
    Field(discriminator="kind"),
]

There is a parallel discriminated union for corrective operations (EditOperation at edit.py:242), which has six members documented below.

Kind

Pydantic discriminated union (detection rules, discriminator kind). Each member carries its own on_fail: EditOperation field that selects the corrective action.

Source of truth

market_insights_models/src/commodity_hindcast/lib/edit_and_imputation/edit.py

Consumed by config.py:190 (YieldsBuilder.edits: list[EditRuleConfig]) and applied at features build time via apply_edits (edit.py:383).

Detection rule members

All four rule classes inherit from _EditRuleBase (edit.py:258), which carries:

Field Type Description
name str Unique label used in EditReport and log messages
target str Column name on which detection and corrective action act
on_fail EditOperation Corrective action to apply on firing rows

RatioEditRule

File:LINE: edit.py:269

Purpose: Fires when target / eval(derive) falls outside the symmetric tolerance band [1/tolerance, tolerance] — the canonical cross-column balance edit (e.g. yield vs production/area).

Pydantic fields:

Field Type Notes
kind Literal["ratio_edit"] Discriminator
derive str pandas.DataFrame.eval-compatible expression
tolerance float Must be > 1; the band is [1/tol, tol] around 1

The _tolerance_gt_one validator (edit.py:281) rejects tolerance <= 1.0 at config load time.

When used: Any YAML rule with kind: ratio_edit under YieldsBuilder.edits. Typical application: checking that reported yield_kg_ha ≈ production_kg / area_harvested_ha to within a configurable threshold.

RangeEditRule

File:LINE: edit.py:306

Purpose: Fires when target falls outside a declared [min, max] interval — the canonical hard-edit constraint (e.g. reject yield values that are physically impossible).

Pydantic fields:

Field Type Notes
kind Literal["range_edit"] Discriminator
min float \| None Lower bound; None skips lower check
max float \| None Upper bound; None skips upper check

At least one of min / max must be set (_at_least_one_bound validator, edit.py:313). NaN target values do not fire (masked by fillna(False), edit.py:326).

When used: Yield plausibility gates; e.g. reject area_harvested_ha <= 0.

NullImputeRule

File:LINE: edit.py:329

Purpose: Fires when target is null. The canonical pairing is on_fail: DeductiveImpute — the first step of the Fellegi-Holt imputation cascade.

Pydantic fields:

Field Type Notes
kind Literal["null_impute"] Discriminator

No additional fields beyond _EditRuleBase. The detect_fires implementation is df[self.target].isna() (edit.py:342).

When used: Filling yield_kg_ha when it is null but production_kg and area_harvested_ha are known.

PanelNullImputeRule

File:LINE: edit.py:345

Purpose: Fires when target is null; unlike NullImputeRule, the corrective operation may reference other rows in the same panel group (typically geo_identifier × year). Required by PanelTrailingMedian — that operation type-checks for this rule class and raises TypeError if it receives a plain NullImputeRule.

Pydantic fields:

Field Type Default Notes
kind Literal["panel_null_impute"] Discriminator
group_by str "geo_identifier" Panel grouping column
order_by str "year" Time-ordering column within the group

When used: Imputing area_harvested_ha from trailing county-level history using PanelTrailingMedian.

Discriminated union members — EditOperation

EditOperation (edit.py:242) is the second discriminated union in this module, selecting the corrective action applied to rows where a detection rule fires. It is discriminated on the operation field.

EditOperation = Annotated[
    DeductiveImpute | Clip | Flag | Drop | Fail | PanelTrailingMedian,
    Field(discriminator="operation"),
]

Clip

File:LINE: edit.py:113

Purpose: Winsorises target to [min, max] on firing rows only; non-firing rows are left unchanged.

Pydantic fields:

Field Type Default Notes
operation Literal["clip"] "clip" Discriminator
min float \| None None Lower clip bound
max float \| None None Upper clip bound

At least one of min / max required (_at_least_one_bound, edit.py:126). The apply method uses pd.Series.clip(lower, upper) on the fires index slice only (edit.py:143).

When used: Capping implausible ratio outliers to a plausible range rather than dropping the row.

Flag

File:LINE: edit.py:149

Purpose: Records the fire in EditReport.flags and leaves the value untouched. A diagnostic-only action: no data is altered.

Pydantic fields:

Field Type Notes
operation Literal["flag"] Discriminator

apply returns df unchanged (edit.py:162).

When used: Surfacing suspicious values for downstream inspection without altering the pipeline output.

Drop

File:LINE: edit.py:165

Purpose: Removes firing rows from the working frame via df.loc[~fires].copy() (edit.py:178). Dropped rows appear as False in later rules' flag columns (because flags is always reindexed to the original input, edit.py:405).

Pydantic fields:

Field Type Notes
operation Literal["drop"] Discriminator

When used: Deleting records whose values are so implausible that no imputation is appropriate.

Fail

File:LINE: edit.py:181

Purpose: Raises ValueError if any row fires. A hard gate: the pipeline cannot continue if this rule fires.

Pydantic fields:

Field Type Notes
operation Literal["fail"] Discriminator

apply calls fires.any() and raises with a message quoting rule.name and the fire count (edit.py:195).

When used: Asserting invariants that must hold before training begins (e.g. a pre-condition that area_harvested_ha is never negative after earlier edits).

PanelTrailingMedian

File:LINE: edit.py:201

Purpose: Imputes firing rows with the per-geo trailing median by delegating to impute_missing_area in imputation.py. This is the only operation that crosses row boundaries and the only one that requires a PanelNullImputeRule as its detection rule.

Pydantic fields:

Field Type Default Notes
operation Literal["panel_trailing_median"] "panel_trailing_median" Discriminator
lookback_years int 3 Number of trailing years for the median window
strictly_causal bool True When True, fires rows are excluded from the history used to compute the median

apply uses a local import of imputation.impute_missing_area to avoid a circular import (edit.py:224). When strictly_causal=True, only non-firing rows form the historical panel (edit.py:231).

When used: Imputing missing area_harvested_ha from a county's own trailing history.

DeductiveImpute

File:LINE: edit.py:91

Purpose: Replaces target on firing rows with a value derived from pandas.DataFrame.eval(source) — the canonical algebraic imputation step (e.g. yield_kg_ha = production_kg / area_harvested_ha).

Pydantic fields:

Field Type Notes
operation Literal["deductive_impute"] Discriminator
source str DataFrame.eval-compatible expression

apply calls cast(pd.Series, df.eval(self.source)) and assigns the result to out.loc[fires, rule.target] (edit.py:107–109).

When used: Filling null yield_kg_ha from known production_kg and area_harvested_ha; the first step in the Fellegi-Holt cascade.

Why these members exist

The module follows the Fellegi-Holt edit-and-imputation paradigm (Fellegi & Holt, 1976), adopted by USDA NASS, Statistics Canada, and UNECE for official survey data editing. Under this paradigm, raw survey records first pass through edit rules that detect inconsistencies, then through an imputation cascade that resolves them with minimal alteration — deductive imputation first (algebra from known fields), then donor or regression imputation. The six EditOperation variants in this module implement a practical subset sufficient for commodity-hindcast preprocessing: deductive imputation and panel-trailing-median cover the two imputation tiers; clip and drop handle out-of-range values; flag supports diagnostic inspection without data alteration; fail enforces hard pre-conditions that must hold before any model is trained. This design mirrors the validate / deductive / simputation R packages (Statistics Netherlands, van der Loo and de Jonge) that are the de-facto reference implementations of the paradigm.

Order of evaluation

Rules fire in YAML declaration order. apply_edits (edit.py:383) iterates the rules sequence, calling rule.detect_fires(out) against the current working frame out and then rule.on_fail.apply(out, rule, fires). Because each rule receives the output of the previous one, a Drop in position N shrinks the frame seen by rules N+1, N+2, …. The EditReport.flags DataFrame is always reindexed to the original input (edit.py:405), so dropped rows appear as False in all subsequent flag columns and callers can recover the full firing picture against the unchanged input index.

Note on imputation.py

edit.py handles row-level edits on raw survey data. Its companion imputation.py provides feature-level NaN fills for the detrenders and regressors. The two modules are complementary and non-overlapping in responsibility: edit.py operates on raw survey tables before the feature pivot; imputation.py (MedianImputer, impute_missing_panel_columns) operates on the assembled feature matrices during FIT and PREDICT. PanelTrailingMedian.apply is the sole coupling point — it calls impute_missing_area from imputation.py via a local import to avoid a circular dependency (edit.py:224). See source: lib for full imputation.py coverage.

Lifecycle

  1. ExperimentConfig is loaded from YAML. YieldsBuilder.edits deserialises each rule entry via the EditRuleConfig discriminated union; the kind field selects the concrete rule class.
  2. During the features-build stage (cli run features), the yields builder calls apply_edits(raw_df, cfg.builders["yields"].edits).
  3. apply_edits returns (edited_df, EditReport). The edited frame proceeds to the feature pivot; the report is logged but not persisted to disk.
  4. The assembled feature parquets (fit.parquet, pred.parquet) are the artefacts that downstream FIT / PREDICT stages consume.

Relationships

  • Owned by FeatureBuilderConfig via YieldsBuilder.edits: list[EditRuleConfig] (config.py:190).
  • PanelTrailingMedian delegates to impute_missing_area in imputation.py (edit.py:232).
  • EditReport (edit.py:370) is the companion output dataclass produced by apply_edits alongside the edited frame.

Concepts and pipelines

  • Fellegi-Holt edit-and-imputation paradigm (concept, to be written).
  • Features pipeline (pipeline, to be written) — step where apply_edits is invoked.

PRs and commits

No dedicated PR found for the introduction of EditRuleConfig; the module predates the tracked PR window. PanelTrailingMedian and PanelNullImputeRule were added to support area imputation in the forecast path.

Open questions

  • Operational thresholds (e.g. ratio tolerance per commodity) are visible in YAML configs but not audited here; a separate page could document the per-commodity rule declarations.
  • The strictly_causal default of True on PanelTrailingMedian means firing rows do not contribute to their own imputation history. Whether this is the intended semantics for panel rows in the same year (as opposed to prior years) has not been formally documented.
  • EditReport is logged but not persisted. If rules fire frequently, a parquet artefact under run_dir/ would make post-hoc auditing easier.