Regression Models¶
Overview¶
The regression subsystem lives under market_insights_models/src/commodity_hindcast/models/regression/ and provides three concrete estimators — RidgeRegressor, PcaRidgeRegressor, and XGBRegressor — all sharing the AbstractRegressionImpl abstract base class. The public entry point is build_regressor(config) in __init__.py, which reads config.model.regression to select the implementation and plumbs feature_columns and nan_policy through before construction.
Detrending and any meta-model aggregation happen outside the estimator; the regressors receive already-detrended feature matrices and return detrended predictions. The runtime.py module provides the higher-level predict() function that chains model inference, weather-correction post-processing, and inverse-detrend into the final sim_yield_kg_ha column.
Modules¶
base.py¶
Defines AbstractRegressionImpl (ABC) at base.py:9, the contract that every estimator must implement:
class AbstractRegressionImpl(ABC):
def fit(self, X: pd.DataFrame, y: pd.Series, sample_weight: pd.Series | None = None) -> Self: ...
def predict(self, X: pd.DataFrame) -> pd.Series: ...
def save_model(self, path: Path) -> None: ...
@classmethod
def load_model(cls, path: Path) -> Self: ...
Three shared helper functions are also defined here, used by all concrete classes:
require_feature_columns(base.py:41) — popsfeature_columnsfrom a params dict and raisesValueErrorif absent; enforces that all regressors are built viabuild_regressor().require_nan_policy(base.py:55) — enforcesnan_policy='raise'; feature imputation must precede the regressor stage.select_feature_frame(base.py:69) — selects and copies the declared feature columns fromX, raisingKeyErroron any mismatch.assert_no_nan_features(base.py:81) — raisesValueErrorlisting offending columns if any NaNs are present at fit or predict time.
ridge_regressor.py¶
RidgeRegressor wraps sklearn.linear_model.Ridge (ridge_regressor.py:19).
Hyperparameters (passed via params dict, default shown):
| Parameter | Default |
|---|---|
alpha |
1.0 |
| Any other kwarg | forwarded directly to Ridge(...) |
feature_columns and nan_policy are consumed by the base helpers before the remainder reaches Ridge.
Fit (ridge_regressor.py:38): drops rows where y is NaN, then calls Ridge.fit. Returns Self.
Predict (ridge_regressor.py:58): selects feature frame, asserts no NaNs, delegates to Ridge.predict, wraps result in a pd.Series preserving the input index.
Persistence format: single file ridge_model.pkl written via joblib.dump (ridge_regressor.py:79). Payload: {"model": Ridge, "feature_columns": tuple, "params": dict}. S3 targets are handled transparently via local_context_for_folder.
pca_ridge_regressor.py¶
PcaRidgeRegressor wraps an sklearn Pipeline(StandardScaler → PCA → Ridge) (pca_ridge_regressor.py:65). Ridge is constructed with fit_intercept=False because the StandardScaler centres the data.
Hyperparameters (defaults shown):
| Parameter | Default |
|---|---|
n_components |
2 |
alpha |
1.0 |
Fit (pca_ridge_regressor.py:100): drops NaN-target rows, fits the sklearn pipeline, then immediately calls _orient_components_by_ridge_sign.
PCA component-sign canonicalisation (pca_ridge_regressor.py:37): sklearn's svd_flip chooses eigenvector signs by a sample-space convention, so the same physical direction can flip between walk-forward CV folds. After fitting, the helper iterates over each PC i; if ridge.coef_[i] < 0 it negates both pca.components_[i] and ridge.coef_[i] together. Because only the product components_.T @ coef_ affects predictions, this double-flip is algebraically invariant — predict() output is bit-exact identical, but PC axes now consistently point in the direction of increasing predicted yield across folds.
Predict (pca_ridge_regressor.py:127): logs three timing checkpoints via loguru at DEBUG level (feature-frame prep, pipeline.predict, total).
Persistence format: single file pca_ridge_model.pkl via joblib.dump (pca_ridge_regressor.py:156). Payload: {"pipeline": Pipeline, "params": dict, "feature_columns": tuple}.
xgb_regressor.py¶
XGBRegressor wraps xgboost.XGBRegressor (xgb_regressor.py:41).
Default hyperparameters (xgb_regressor.py:31):
| Parameter | Default |
|---|---|
n_estimators |
20 |
max_depth |
12 |
learning_rate |
0.04 |
min_child_weight |
1 |
subsample |
0.85 |
colsample_bytree |
0.85 |
early_stopping_rounds |
10 |
eval_metric |
"mae" |
Additional constructor flags:
run_hyper_params(bool, defaultFalse) — whenTrue, triggers an exhaustive grid search before fitting.val_split_column(str, default"year") — column used to split the validation set; the latest year's rows become the holdout.
Hyperparameter grid (xgb_regressor.py:22), used by sklearn.model_selection.ParameterGrid:
XGB_PARAM_GRID = {
"n_estimators": (5, 10, 15, 20),
"max_depth": (4, 6, 8, 10, 12),
"learning_rate": (0.03, 0.04, 0.05, 0.06),
"min_child_weight": (1, 3, 5),
"subsample": (0.85,),
"colsample_bytree": (0.85,),
}
Grid size: 4 × 5 × 4 × 3 × 1 × 1 = 240 trials. Selection criterion is validation MAE (xgb_regressor.py:118). This mirrors the original usda_wasde_experiment/train_feats_v2.py single-holdout path (xgb_regressor.py:21).
Fit (xgb_regressor.py:204): splits train/val by the latest year in val_split_column (falling back to a random 80/20 split), applies early stopping, and optionally runs the full grid search.
Persistence format: two files (xgb_regressor.py:267):
- model.json — XGBoost native JSON serialiser via xgb.XGBRegressor.save_model.
- metadata.pkl — joblib.dump payload {"feature_columns": tuple, "params": dict}.
This split is required because neither the XGBoost native serialiser nor joblib accept S3 URIs; both are staged in a temp directory then mirrored via local_context_for_folder.
runtime.py¶
Provides the pipeline-level orchestration layer on top of the raw estimators. Key exports:
predict() (runtime.py:140) — the single entry point used by stages/run_predict. Chain:
1. resolve_runtime_feature_columns — falls back from config to model._feature_columns.
2. prepare_model_input — optional MedianImputer.transform then select_feature_frame.
3. model.predict(X_features) — calls the AbstractRegressionImpl implementation.
4. apply_weather_correction_postprocess — scales by weather_correction_weight, applies per-DOY weights from season_doy_weather_weight, and clips by max_abs_weather_correction_bu_ac.
5. detrender.inverse_transform — retrend to absolute kg/ha.
Returns a copy of the input data DataFrame with two columns appended: sim_yield_kg_ha_detrended and sim_yield_kg_ha.
prepare_model_input() (runtime.py:114) — shared by predict() and the PDP plot-prep path; single source of truth for the feature-matrix shape contract.
apply_weather_correction_postprocess() (runtime.py:40) — standalone post-processor; weather_correction_weight defaults to 1.0, season_doy_weather_weight is an optional DOY→weight interpolation map, and the symmetric clip is controlled by max_abs_weather_correction_bu_ac + clip_reference_bu_ac.
__init__.py¶
Exports AbstractRegressionImpl, RidgeRegressor, PcaRidgeRegressor, XGBRegressor, and build_regressor.
build_regressor(config: ExperimentConfig) (__init__.py:58): reads config.model.regression key ("ridge" | "xgboost" | "pca_ridge"), validates feature_cols against commodity.feature_cols universe, enforces nan_policy='raise', then constructs the appropriate class. Raises ValueError on any unknown key.
Cross-references¶
models/detrend/—AbstractDetrend.inverse_transformis called byruntime.predict.stages/run_fit.py— callsbuild_regressor(config)to obtain an estimator then calls.fit().stages/run_predict.py— callsruntime.predict()for the full predict→weather-correct→retrend chain.MedianImputer(lib/edit_and_imputation/imputation.py) — optional imputer injected intoruntime.prepare_model_input.ExperimentConfig(config.py) — carriesmodel.regression,model.regression_params, andcommodity.feature_cols.
Relationships¶
ExperimentConfig
└─ build_regressor() [__init__.py:58]
├─ RidgeRegressor ──► joblib (ridge_model.pkl)
├─ PcaRidgeRegressor ──► joblib (pca_ridge_model.pkl)
└─ XGBRegressor ──► xgb JSON (model.json) + joblib (metadata.pkl)
runtime.predict()
├─ AbstractRegressionImpl.predict()
├─ apply_weather_correction_postprocess()
└─ AbstractDetrend.inverse_transform()
All three concrete classes satisfy AbstractRegressionImpl (base.py:9). The runtime.py predict() function accepts any AbstractRegressionImpl instance, keeping stages decoupled from the concrete estimator choice.