Skip to content

Detrending Subsystem

Overview

The detrending subsystem removes long-run yield trends from county-level feature tables before regression, then re-applies the trend to bring predictions back to absolute scale. All three concrete detrenders inherit from a single abstract base (AbstractDetrend, base.py:21) and share one trend-axis convention defined in time_axis.py. A factory function (build.py:54) instantiates the correct detrender from the config.model.detrend string key.

The regression pipeline calls fit_transform on training data, serialises the fitted detrender to detrender.pkl, then calls transform on any held-out fold and inverse_transform inside models/regression/runtime.py:176 to retrend model predictions. This means every yield figure the regressor sees is trend-adjusted, and every output it produces is converted back to absolute kg/ha before delivery.

Modules

time_axis.pyTrendAxis and TREND_AXIS

TrendAxis (time_axis.py:48) is a frozen dataclass owning the x-axis convention for every yield-trend fit: an epoch (default 1980-01-01) and a unit ('year' or 'day'). The project-wide singleton TREND_AXIS (time_axis.py:91) uses unit='year' with epoch 1980-01-01. All detrenders import this singleton so that slopes reported in configs and logs are always in yield-per-calendar-year units, matching the QUBE-sprint TrendFitter convention.

Key methods:

  • to_x(years) — converts integer calendar years to x-axis values using pd.to_datetime for leap-year-exact day counts, then divides by 365.25 when unit='year' (time_axis.py:69).
  • slope_per_year_to_native(slope) / slope_native_to_per_year(slope) — exact inverses for converting between config units and the internal fit unit (time_axis.py:79–85).

base.pyAbstractDetrend

AbstractDetrend (base.py:21) is the ABC that every detrender implements. It stores a single config: ExperimentConfig and mandates:

Abstract member Purpose
fit_transform(features) Fit on training data and return a copy with target_detrended_col added.
transform(features) Apply already-fitted trend; raises RuntimeError if not yet fitted.
inverse_transform(features, y_detrended) Add trend back to detrended predictions.
fitted_yield_series(features) Return model-implied trend level per row (for diagnostics).
is_fitted (property) Boolean — guards transform and save.
target_yield_column (property) Name of the raw yield column in feature tables.
save(path) / load(cls, path, config) Serialise / deserialise fitted state; path supports s3:// via vfs.local_context.

target_detrended_column (base.py:44) is non-abstract and delegates to config.commodity.target_detrended_col. state_column (base.py:48) returns the fixed string "state_name".

A free function validate_detrender (base.py:77) checks that an object is an AbstractDetrend instance and raises TypeError otherwise.

build.pybuild_detrender

build_detrender(config) (build.py:54) reads config.model.detrend and dispatches:

Key Class
"linear_state" LinearStateDetrend
"gaussian_state" GaussianWindowStateDetrend
"partial_pooling" PartialPoolingDetrend

Only "partial_pooling" accepts detrend_params from the config (build.py:18–51): allowlisted keys are min_obs, detrend_fixed_slope_bu_ac, and mad_national_modified_z. Unknown keys emit a warning. Any other detrend value raises ValueError (build.py:70).

linear_state_detrend.pyLinearStateDetrend

State-space formulation. For each state, an area-weighted mean yield is computed across all counties per calendar year. OLS (numpy.polyfit) fits a straight line ŷ(t) = m·x(t) + b where x(t) = TREND_AXIS.to_x(t). The per-state (slope, intercept) pair is stored in _state_trends. A national fallback trend is also fitted on the full dataset and used via NationalFallbackTrendImputer for rows whose state was not seen during training (linear_state_detrend.py:159–169).

Hyperparameters. None that are config-driven; trend_imputer defaults to NationalFallbackTrendImputer() and can be overridden in the constructor (linear_state_detrend.py:35).

Key signatures.

# linear_state_detrend.py:173
def fit_transform(self, features: pd.DataFrame) -> pd.DataFrame: ...

# linear_state_detrend.py:209
def transform(self, features: pd.DataFrame) -> pd.DataFrame: ...

# linear_state_detrend.py:239
def inverse_transform(self, features: pd.DataFrame, y_detrended: np.ndarray) -> pd.Series: ...

Persistence. Serialised state contains _state_trends and _national_trend (linear_state_detrend.py:263). Written and read via joblib under vfs.local_context so S3 paths are handled transparently.

When preferred. Fast training baseline; appropriate for commodities or geographies where the trend is well-approximated by a straight line over the training window and where county-level noise is not a concern.

gaussian_window_state_detrend.pyGaussianWindowStateDetrend

State-space formulation. Rather than fitting a parametric line, per-state trend levels are read from a Gaussian-smoothed series of area-weighted annual mean yields. The pipeline is: (1) IQR trim to the P25–P75 band per year to remove outlier counties; (2) area-weighted annual mean; (3) fill a dense integer-year grid by linear interpolation then forward/back fill; (4) apply scipy.ndimage.gaussian_filter1d (symmetric) or a causal half-Gaussian (_half_gaussian_causal_smooth_1d) in the year dimension. The resulting pd.Series indexed by integer year is stored per state in _state_smooth (gaussian_window_state_detrend.py:153). Rows in unseen states fall back to _national_smooth via NationalFallbackTrendImputer. This design is intentionally aligned with the legacy WASDE FittedYieldDetrendStateMA (gaussian_window_state_detrend.py:117–131).

Hyperparameters (class-level attributes, overridable before fit_transform):

Attribute Default Meaning
gaussian_sigma 4.0 Smoothing width in years
gaussian_truncate 8.0 Kernel truncation radius (multiples of sigma)
gaussian_mode "nearest" Boundary handling for gaussian_filter1d
gaussian_kernel "symmetric" "symmetric" (two-sided) or "half_left" (causal)
mean_yield_q_lo 0.25 Lower IQR trim quantile
mean_yield_q_hi 0.75 Upper IQR trim quantile

Key signatures.

# gaussian_window_state_detrend.py:231
def fit_transform(self, features: pd.DataFrame) -> pd.DataFrame: ...

# gaussian_window_state_detrend.py:280
def transform(self, features: pd.DataFrame) -> pd.DataFrame: ...

# gaussian_window_state_detrend.py:309
def inverse_transform(self, features: pd.DataFrame, y_detrended: np.ndarray) -> pd.Series: ...

Persistence. Serialised state includes _national_smooth, _state_smooth, and all six hyperparameter attributes (gaussian_window_state_detrend.py:327–345). Loaded values restore all attributes so the same smooth is reproducible without re-fitting.

When preferred. Smoother, more local trend estimate than a global OLS line; tolerates non-linear technology-yield trajectories. The causal (half_left) kernel variant avoids look-ahead when the test year is near the training boundary.

partial_pooling_detrend.pyPartialPoolingDetrend

State-space formulation. This is an Empirical-Bayes (James-Stein) hierarchical linear model operating at county rather than state granularity. The four-stage fit (partial_pooling_detrend.py:206–361) is:

  1. National slope (EB prior) — area-weighted national annual means → OLS slope, optionally MAD-filtered (mad_national_modified_z) to remove extreme years, or replaced entirely by a fixed config value (detrend_fixed_slope_bu_ac, converted to kg/ha on construction).
  2. Per-county OLS — vectorised demeaned regression for counties with at least min_obs observations: slope = Σ(dx·dy)/Σ(dx²), SE(slope) = sqrt(RSS / (n−2) / Sxx). Counties below min_obs receive the national slope directly.
  3. James-Stein shrinkage — between-county variance τ² = max(0, Var(slopes) − mean(SE²)); shrinkage weight λᵢ = τ²/(τ² + SEᵢ²); EB slope = λᵢ·county_slope + (1−λᵢ)·national_slope. Counties with high SE (few data points or noisy series) are pulled strongly toward the national prior.
  4. Per-county interceptsmean(y − eb_slope·x) per county, ensuring the trend line passes through each county's mean.

Unseen counties at prediction time fall back to the national trend slope via NationalFallbackTrendImputer (partial_pooling_detrend.py:418–425).

Hyperparameters (constructor keyword arguments, forwarded from build.py):

Parameter Default Meaning
min_obs 3 Minimum county observations for OLS; below this → national slope.
detrend_fixed_slope_bu_ac None Override the EB prior with a fixed slope (bu/ac/yr).
mad_national_modified_z None Modified-Z threshold for MAD filtering of national annual means before OLS.

Key signatures.

# partial_pooling_detrend.py:183
def fit_transform(self, features: pd.DataFrame) -> pd.DataFrame: ...

# partial_pooling_detrend.py:196
def transform(self, features: pd.DataFrame) -> pd.DataFrame: ...

# partial_pooling_detrend.py:429
def inverse_transform(self, features: pd.DataFrame, y_detrended: np.ndarray) -> pd.Series: ...

Diagnostics. The .parameters property (partial_pooling_detrend.py:159) returns a lazily-built, cached pd.DataFrame with one row per county: slope_per_year, intercept, national_slope_per_year, raw_slope_per_year, eb_lambda, train_years, n_train_years, train_year_min, train_year_max. All slopes are expressed in yield-per-year via TREND_AXIS.slope_native_to_per_year.

Persistence. Serialised state stores all fitted dicts (_county_slopes, _county_intercepts, _eb_raw_slopes, _eb_lambdas, _county_train_years), scalar fit statistics (_national_slope, _tau2), and the constructor hyperparameters (partial_pooling_detrend.py:447–472). The fixed slope is stored already converted to kg/ha.

When preferred. Hierarchical Bayesian pooling across counties; the recommended choice when county sample sizes vary widely (e.g. sparse rural counties alongside dense agricultural belts) because weak counties borrow strength from the national prior rather than producing noisy OLS slopes.

Integration with Regression

The regression pipeline (run/experiment_protocol.py:38–65) calls train(...) which fits and serialises the detrender; then fold.load_detrender(config) rehydrates it. detrender.transform(train_data) adds target_detrended_col to the feature table, which the regressor trains on. At prediction time, models/regression/runtime.predict (runtime.py:140–184) calls detrender.inverse_transform(data_regression, sim_detrended) to convert the model's detrended output back to absolute yield. This two-step bracket — fit_transform before fitting the regressor, inverse_transform after scoring — means the regressor never sees raw trend-contaminated yields and every output is in physically meaningful kg/ha units.

Cross-references

  • Config sourceconfig.model.detrend and config.model.detrend_params control dispatch.
  • run_fit stage — calls train(...) which invokes fit_transform and persists detrender.pkl.
  • regression runtimepredict function calls inverse_transform (runtime.py:176).
  • lib/edit_and_imputation/imputation.pyNationalFallbackTrendImputer, TrendImputer, partition_groups_by_valid_obs used by all three detrenders.
  • lib/geo/aggregation.pyarea_weighted_mean used to aggregate county yields to state/national level before trend fitting.

Relationships

  • build_detrender(config) → one of LinearStateDetrend | GaussianWindowStateDetrend | PartialPoolingDetrend
  • All three → AbstractDetrend (inheritance)
  • LinearStateDetrend and PartialPoolingDetrendTREND_AXIS.to_x for x-axis construction
  • GaussianWindowStateDetrend → integer-year index directly (no to_x needed; indexed by year)
  • All three → NationalFallbackTrendImputer for unseen-state/county fallback
  • All three → joblib + vfs.local_context for S3-safe persistence
  • PartialPoolingDetrend additionally → lib/unit_utils.bu_acre_to_kg_ha for fixed-slope unit conversion