Skip to content

Regression Models

Overview

The regression subsystem lives under market_insights_models/src/commodity_hindcast/models/regression/ and provides three concrete estimators — RidgeRegressor, PcaRidgeRegressor, and XGBRegressor — all sharing the AbstractRegressionImpl abstract base class. The public entry point is build_regressor(config) in __init__.py, which reads config.model.regression to select the implementation and plumbs feature_columns and nan_policy through before construction.

Detrending and any meta-model aggregation happen outside the estimator; the regressors receive already-detrended feature matrices and return detrended predictions. The runtime.py module provides the higher-level predict() function that chains model inference, weather-correction post-processing, and inverse-detrend into the final sim_yield_kg_ha column.

Modules

base.py

Defines AbstractRegressionImpl (ABC) at base.py:9, the contract that every estimator must implement:

class AbstractRegressionImpl(ABC):
    def fit(self, X: pd.DataFrame, y: pd.Series, sample_weight: pd.Series | None = None) -> Self: ...
    def predict(self, X: pd.DataFrame) -> pd.Series: ...
    def save_model(self, path: Path) -> None: ...
    @classmethod
    def load_model(cls, path: Path) -> Self: ...

Three shared helper functions are also defined here, used by all concrete classes:

  • require_feature_columns (base.py:41) — pops feature_columns from a params dict and raises ValueError if absent; enforces that all regressors are built via build_regressor().
  • require_nan_policy (base.py:55) — enforces nan_policy='raise'; feature imputation must precede the regressor stage.
  • select_feature_frame (base.py:69) — selects and copies the declared feature columns from X, raising KeyError on any mismatch.
  • assert_no_nan_features (base.py:81) — raises ValueError listing offending columns if any NaNs are present at fit or predict time.

ridge_regressor.py

RidgeRegressor wraps sklearn.linear_model.Ridge (ridge_regressor.py:19).

Hyperparameters (passed via params dict, default shown):

Parameter Default
alpha 1.0
Any other kwarg forwarded directly to Ridge(...)

feature_columns and nan_policy are consumed by the base helpers before the remainder reaches Ridge.

Fit (ridge_regressor.py:38): drops rows where y is NaN, then calls Ridge.fit. Returns Self.

Predict (ridge_regressor.py:58): selects feature frame, asserts no NaNs, delegates to Ridge.predict, wraps result in a pd.Series preserving the input index.

Persistence format: single file ridge_model.pkl written via joblib.dump (ridge_regressor.py:79). Payload: {"model": Ridge, "feature_columns": tuple, "params": dict}. S3 targets are handled transparently via local_context_for_folder.

pca_ridge_regressor.py

PcaRidgeRegressor wraps an sklearn Pipeline(StandardScaler → PCA → Ridge) (pca_ridge_regressor.py:65). Ridge is constructed with fit_intercept=False because the StandardScaler centres the data.

Hyperparameters (defaults shown):

Parameter Default
n_components 2
alpha 1.0

Fit (pca_ridge_regressor.py:100): drops NaN-target rows, fits the sklearn pipeline, then immediately calls _orient_components_by_ridge_sign.

PCA component-sign canonicalisation (pca_ridge_regressor.py:37): sklearn's svd_flip chooses eigenvector signs by a sample-space convention, so the same physical direction can flip between walk-forward CV folds. After fitting, the helper iterates over each PC i; if ridge.coef_[i] < 0 it negates both pca.components_[i] and ridge.coef_[i] together. Because only the product components_.T @ coef_ affects predictions, this double-flip is algebraically invariant — predict() output is bit-exact identical, but PC axes now consistently point in the direction of increasing predicted yield across folds.

Predict (pca_ridge_regressor.py:127): logs three timing checkpoints via loguru at DEBUG level (feature-frame prep, pipeline.predict, total).

Persistence format: single file pca_ridge_model.pkl via joblib.dump (pca_ridge_regressor.py:156). Payload: {"pipeline": Pipeline, "params": dict, "feature_columns": tuple}.

xgb_regressor.py

XGBRegressor wraps xgboost.XGBRegressor (xgb_regressor.py:41).

Default hyperparameters (xgb_regressor.py:31):

Parameter Default
n_estimators 20
max_depth 12
learning_rate 0.04
min_child_weight 1
subsample 0.85
colsample_bytree 0.85
early_stopping_rounds 10
eval_metric "mae"

Additional constructor flags:

  • run_hyper_params (bool, default False) — when True, triggers an exhaustive grid search before fitting.
  • val_split_column (str, default "year") — column used to split the validation set; the latest year's rows become the holdout.

Hyperparameter grid (xgb_regressor.py:22), used by sklearn.model_selection.ParameterGrid:

XGB_PARAM_GRID = {
    "n_estimators":      (5, 10, 15, 20),
    "max_depth":         (4, 6, 8, 10, 12),
    "learning_rate":     (0.03, 0.04, 0.05, 0.06),
    "min_child_weight":  (1, 3, 5),
    "subsample":         (0.85,),
    "colsample_bytree":  (0.85,),
}

Grid size: 4 × 5 × 4 × 3 × 1 × 1 = 240 trials. Selection criterion is validation MAE (xgb_regressor.py:118). This mirrors the original usda_wasde_experiment/train_feats_v2.py single-holdout path (xgb_regressor.py:21).

Fit (xgb_regressor.py:204): splits train/val by the latest year in val_split_column (falling back to a random 80/20 split), applies early stopping, and optionally runs the full grid search.

Persistence format: two files (xgb_regressor.py:267): - model.json — XGBoost native JSON serialiser via xgb.XGBRegressor.save_model. - metadata.pkljoblib.dump payload {"feature_columns": tuple, "params": dict}.

This split is required because neither the XGBoost native serialiser nor joblib accept S3 URIs; both are staged in a temp directory then mirrored via local_context_for_folder.

runtime.py

Provides the pipeline-level orchestration layer on top of the raw estimators. Key exports:

predict() (runtime.py:140) — the single entry point used by stages/run_predict. Chain: 1. resolve_runtime_feature_columns — falls back from config to model._feature_columns. 2. prepare_model_input — optional MedianImputer.transform then select_feature_frame. 3. model.predict(X_features) — calls the AbstractRegressionImpl implementation. 4. apply_weather_correction_postprocess — scales by weather_correction_weight, applies per-DOY weights from season_doy_weather_weight, and clips by max_abs_weather_correction_bu_ac. 5. detrender.inverse_transform — retrend to absolute kg/ha.

Returns a copy of the input data DataFrame with two columns appended: sim_yield_kg_ha_detrended and sim_yield_kg_ha.

prepare_model_input() (runtime.py:114) — shared by predict() and the PDP plot-prep path; single source of truth for the feature-matrix shape contract.

apply_weather_correction_postprocess() (runtime.py:40) — standalone post-processor; weather_correction_weight defaults to 1.0, season_doy_weather_weight is an optional DOY→weight interpolation map, and the symmetric clip is controlled by max_abs_weather_correction_bu_ac + clip_reference_bu_ac.

__init__.py

Exports AbstractRegressionImpl, RidgeRegressor, PcaRidgeRegressor, XGBRegressor, and build_regressor.

build_regressor(config: ExperimentConfig) (__init__.py:58): reads config.model.regression key ("ridge" | "xgboost" | "pca_ridge"), validates feature_cols against commodity.feature_cols universe, enforces nan_policy='raise', then constructs the appropriate class. Raises ValueError on any unknown key.

Cross-references

  • models/detrend/AbstractDetrend.inverse_transform is called by runtime.predict.
  • stages/run_fit.py — calls build_regressor(config) to obtain an estimator then calls .fit().
  • stages/run_predict.py — calls runtime.predict() for the full predict→weather-correct→retrend chain.
  • MedianImputer (lib/edit_and_imputation/imputation.py) — optional imputer injected into runtime.prepare_model_input.
  • ExperimentConfig (config.py) — carries model.regression, model.regression_params, and commodity.feature_cols.

Relationships

ExperimentConfig
    └─ build_regressor()  [__init__.py:58]
           ├─ RidgeRegressor        ──► joblib  (ridge_model.pkl)
           ├─ PcaRidgeRegressor     ──► joblib  (pca_ridge_model.pkl)
           └─ XGBRegressor          ──► xgb JSON (model.json) + joblib (metadata.pkl)

runtime.predict()
    ├─ AbstractRegressionImpl.predict()
    ├─ apply_weather_correction_postprocess()
    └─ AbstractDetrend.inverse_transform()

All three concrete classes satisfy AbstractRegressionImpl (base.py:9). The runtime.py predict() function accepts any AbstractRegressionImpl instance, keeping stages decoupled from the concrete estimator choice.