Detrending Subsystem¶
Overview¶
The detrending subsystem removes long-run yield trends from county-level feature tables before regression, then re-applies the trend to bring predictions back to absolute scale. All three concrete detrenders inherit from a single abstract base (AbstractDetrend, base.py:21) and share one trend-axis convention defined in time_axis.py. A factory function (build.py:54) instantiates the correct detrender from the config.model.detrend string key.
The regression pipeline calls fit_transform on training data, serialises the fitted detrender to detrender.pkl, then calls transform on any held-out fold and inverse_transform inside models/regression/runtime.py:176 to retrend model predictions. This means every yield figure the regressor sees is trend-adjusted, and every output it produces is converted back to absolute kg/ha before delivery.
Modules¶
time_axis.py — TrendAxis and TREND_AXIS¶
TrendAxis (time_axis.py:48) is a frozen dataclass owning the x-axis convention for every yield-trend fit: an epoch (default 1980-01-01) and a unit ('year' or 'day'). The project-wide singleton TREND_AXIS (time_axis.py:91) uses unit='year' with epoch 1980-01-01. All detrenders import this singleton so that slopes reported in configs and logs are always in yield-per-calendar-year units, matching the QUBE-sprint TrendFitter convention.
Key methods:
to_x(years)— converts integer calendar years to x-axis values usingpd.to_datetimefor leap-year-exact day counts, then divides by365.25whenunit='year'(time_axis.py:69).slope_per_year_to_native(slope)/slope_native_to_per_year(slope)— exact inverses for converting between config units and the internal fit unit (time_axis.py:79–85).
base.py — AbstractDetrend¶
AbstractDetrend (base.py:21) is the ABC that every detrender implements. It stores a single config: ExperimentConfig and mandates:
| Abstract member | Purpose |
|---|---|
fit_transform(features) |
Fit on training data and return a copy with target_detrended_col added. |
transform(features) |
Apply already-fitted trend; raises RuntimeError if not yet fitted. |
inverse_transform(features, y_detrended) |
Add trend back to detrended predictions. |
fitted_yield_series(features) |
Return model-implied trend level per row (for diagnostics). |
is_fitted (property) |
Boolean — guards transform and save. |
target_yield_column (property) |
Name of the raw yield column in feature tables. |
save(path) / load(cls, path, config) |
Serialise / deserialise fitted state; path supports s3:// via vfs.local_context. |
target_detrended_column (base.py:44) is non-abstract and delegates to config.commodity.target_detrended_col. state_column (base.py:48) returns the fixed string "state_name".
A free function validate_detrender (base.py:77) checks that an object is an AbstractDetrend instance and raises TypeError otherwise.
build.py — build_detrender¶
build_detrender(config) (build.py:54) reads config.model.detrend and dispatches:
| Key | Class |
|---|---|
"linear_state" |
LinearStateDetrend |
"gaussian_state" |
GaussianWindowStateDetrend |
"partial_pooling" |
PartialPoolingDetrend |
Only "partial_pooling" accepts detrend_params from the config (build.py:18–51): allowlisted keys are min_obs, detrend_fixed_slope_bu_ac, and mad_national_modified_z. Unknown keys emit a warning. Any other detrend value raises ValueError (build.py:70).
linear_state_detrend.py — LinearStateDetrend¶
State-space formulation. For each state, an area-weighted mean yield is computed across all counties per calendar year. OLS (numpy.polyfit) fits a straight line ŷ(t) = m·x(t) + b where x(t) = TREND_AXIS.to_x(t). The per-state (slope, intercept) pair is stored in _state_trends. A national fallback trend is also fitted on the full dataset and used via NationalFallbackTrendImputer for rows whose state was not seen during training (linear_state_detrend.py:159–169).
Hyperparameters. None that are config-driven; trend_imputer defaults to NationalFallbackTrendImputer() and can be overridden in the constructor (linear_state_detrend.py:35).
Key signatures.
# linear_state_detrend.py:173
def fit_transform(self, features: pd.DataFrame) -> pd.DataFrame: ...
# linear_state_detrend.py:209
def transform(self, features: pd.DataFrame) -> pd.DataFrame: ...
# linear_state_detrend.py:239
def inverse_transform(self, features: pd.DataFrame, y_detrended: np.ndarray) -> pd.Series: ...
Persistence. Serialised state contains _state_trends and _national_trend (linear_state_detrend.py:263). Written and read via joblib under vfs.local_context so S3 paths are handled transparently.
When preferred. Fast training baseline; appropriate for commodities or geographies where the trend is well-approximated by a straight line over the training window and where county-level noise is not a concern.
gaussian_window_state_detrend.py — GaussianWindowStateDetrend¶
State-space formulation. Rather than fitting a parametric line, per-state trend levels are read from a Gaussian-smoothed series of area-weighted annual mean yields. The pipeline is: (1) IQR trim to the P25–P75 band per year to remove outlier counties; (2) area-weighted annual mean; (3) fill a dense integer-year grid by linear interpolation then forward/back fill; (4) apply scipy.ndimage.gaussian_filter1d (symmetric) or a causal half-Gaussian (_half_gaussian_causal_smooth_1d) in the year dimension. The resulting pd.Series indexed by integer year is stored per state in _state_smooth (gaussian_window_state_detrend.py:153). Rows in unseen states fall back to _national_smooth via NationalFallbackTrendImputer. This design is intentionally aligned with the legacy WASDE FittedYieldDetrendStateMA (gaussian_window_state_detrend.py:117–131).
Hyperparameters (class-level attributes, overridable before fit_transform):
| Attribute | Default | Meaning |
|---|---|---|
gaussian_sigma |
4.0 |
Smoothing width in years |
gaussian_truncate |
8.0 |
Kernel truncation radius (multiples of sigma) |
gaussian_mode |
"nearest" |
Boundary handling for gaussian_filter1d |
gaussian_kernel |
"symmetric" |
"symmetric" (two-sided) or "half_left" (causal) |
mean_yield_q_lo |
0.25 |
Lower IQR trim quantile |
mean_yield_q_hi |
0.75 |
Upper IQR trim quantile |
Key signatures.
# gaussian_window_state_detrend.py:231
def fit_transform(self, features: pd.DataFrame) -> pd.DataFrame: ...
# gaussian_window_state_detrend.py:280
def transform(self, features: pd.DataFrame) -> pd.DataFrame: ...
# gaussian_window_state_detrend.py:309
def inverse_transform(self, features: pd.DataFrame, y_detrended: np.ndarray) -> pd.Series: ...
Persistence. Serialised state includes _national_smooth, _state_smooth, and all six hyperparameter attributes (gaussian_window_state_detrend.py:327–345). Loaded values restore all attributes so the same smooth is reproducible without re-fitting.
When preferred. Smoother, more local trend estimate than a global OLS line; tolerates non-linear technology-yield trajectories. The causal (half_left) kernel variant avoids look-ahead when the test year is near the training boundary.
partial_pooling_detrend.py — PartialPoolingDetrend¶
State-space formulation. This is an Empirical-Bayes (James-Stein) hierarchical linear model operating at county rather than state granularity. The four-stage fit (partial_pooling_detrend.py:206–361) is:
- National slope (EB prior) — area-weighted national annual means → OLS slope, optionally MAD-filtered (
mad_national_modified_z) to remove extreme years, or replaced entirely by a fixed config value (detrend_fixed_slope_bu_ac, converted to kg/ha on construction). - Per-county OLS — vectorised demeaned regression for counties with at least
min_obsobservations:slope = Σ(dx·dy)/Σ(dx²),SE(slope) = sqrt(RSS / (n−2) / Sxx). Counties belowmin_obsreceive the national slope directly. - James-Stein shrinkage — between-county variance
τ² = max(0, Var(slopes) − mean(SE²)); shrinkage weightλᵢ = τ²/(τ² + SEᵢ²); EB slope= λᵢ·county_slope + (1−λᵢ)·national_slope. Counties with high SE (few data points or noisy series) are pulled strongly toward the national prior. - Per-county intercepts —
mean(y − eb_slope·x)per county, ensuring the trend line passes through each county's mean.
Unseen counties at prediction time fall back to the national trend slope via NationalFallbackTrendImputer (partial_pooling_detrend.py:418–425).
Hyperparameters (constructor keyword arguments, forwarded from build.py):
| Parameter | Default | Meaning |
|---|---|---|
min_obs |
3 |
Minimum county observations for OLS; below this → national slope. |
detrend_fixed_slope_bu_ac |
None |
Override the EB prior with a fixed slope (bu/ac/yr). |
mad_national_modified_z |
None |
Modified-Z threshold for MAD filtering of national annual means before OLS. |
Key signatures.
# partial_pooling_detrend.py:183
def fit_transform(self, features: pd.DataFrame) -> pd.DataFrame: ...
# partial_pooling_detrend.py:196
def transform(self, features: pd.DataFrame) -> pd.DataFrame: ...
# partial_pooling_detrend.py:429
def inverse_transform(self, features: pd.DataFrame, y_detrended: np.ndarray) -> pd.Series: ...
Diagnostics. The .parameters property (partial_pooling_detrend.py:159) returns a lazily-built, cached pd.DataFrame with one row per county: slope_per_year, intercept, national_slope_per_year, raw_slope_per_year, eb_lambda, train_years, n_train_years, train_year_min, train_year_max. All slopes are expressed in yield-per-year via TREND_AXIS.slope_native_to_per_year.
Persistence. Serialised state stores all fitted dicts (_county_slopes, _county_intercepts, _eb_raw_slopes, _eb_lambdas, _county_train_years), scalar fit statistics (_national_slope, _tau2), and the constructor hyperparameters (partial_pooling_detrend.py:447–472). The fixed slope is stored already converted to kg/ha.
When preferred. Hierarchical Bayesian pooling across counties; the recommended choice when county sample sizes vary widely (e.g. sparse rural counties alongside dense agricultural belts) because weak counties borrow strength from the national prior rather than producing noisy OLS slopes.
Integration with Regression¶
The regression pipeline (run/experiment_protocol.py:38–65) calls train(...) which fits and serialises the detrender; then fold.load_detrender(config) rehydrates it. detrender.transform(train_data) adds target_detrended_col to the feature table, which the regressor trains on. At prediction time, models/regression/runtime.predict (runtime.py:140–184) calls detrender.inverse_transform(data_regression, sim_detrended) to convert the model's detrended output back to absolute yield. This two-step bracket — fit_transform before fitting the regressor, inverse_transform after scoring — means the regressor never sees raw trend-contaminated yields and every output is in physically meaningful kg/ha units.
Cross-references¶
- Config source —
config.model.detrendandconfig.model.detrend_paramscontrol dispatch. - run_fit stage — calls
train(...)which invokesfit_transformand persistsdetrender.pkl. - regression runtime —
predictfunction callsinverse_transform(runtime.py:176). lib/edit_and_imputation/imputation.py—NationalFallbackTrendImputer,TrendImputer,partition_groups_by_valid_obsused by all three detrenders.lib/geo/aggregation.py—area_weighted_meanused to aggregate county yields to state/national level before trend fitting.
Relationships¶
build_detrender(config)→ one ofLinearStateDetrend | GaussianWindowStateDetrend | PartialPoolingDetrend- All three →
AbstractDetrend(inheritance) LinearStateDetrendandPartialPoolingDetrend→TREND_AXIS.to_xfor x-axis constructionGaussianWindowStateDetrend→ integer-year index directly (noto_xneeded; indexed by year)- All three →
NationalFallbackTrendImputerfor unseen-state/county fallback - All three →
joblib+vfs.local_contextfor S3-safe persistence PartialPoolingDetrendadditionally →lib/unit_utils.bu_acre_to_kg_hafor fixed-slope unit conversion