Regressor¶

Definition¶

A Regressor fits a statistical model mapping detrended feature columns to detrended yield residuals, then produces predictions during inference. All three concrete implementations share the AbstractRegressionImpl ABC (models/regression/base.py:9). The regressor receives an already-detrended, fully-imputed feature matrix — trend removal and NaN imputation both happen upstream. The regressor itself enforces nan_policy='raise' and rejects any NaN input.

Detrending, weather correction, and re-trending are performed by the surrounding runtime.predict function (models/regression/runtime.py:140), not by the regressor. The regressor's responsibility is purely (X_detrended) → y_detrended.

Kind¶

ABC (AbstractRegressionImpl at models/regression/base.py:9).

Source of truth¶

market_insights_models/src/commodity_hindcast/models/regression/base.py:9

Required interface¶

class AbstractRegressionImpl(ABC):
    def fit(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        sample_weight: pd.Series | None = None,
    ) -> Self: ...

    def predict(self, X: pd.DataFrame) -> pd.Series: ...

    def save_model(self, path: Path) -> None: ...

    @classmethod
    def load_model(cls, path: Path) -> Self: ...

Shared helper functions defined at module level in base.py:

Function	Location	Purpose
`require_feature_columns`	`base.py:41`	Pops `feature_columns` from params dict; raises if absent — enforces construction via `build_regressor()`.
`require_nan_policy`	`base.py:55`	Enforces `nan_policy='raise'`; imputation must precede the regressor.
`select_feature_frame`	`base.py:69`	Selects and copies declared feature columns from `X`; raises `KeyError` on any mismatch.
`assert_no_nan_features`	`base.py:81`	Raises `ValueError` listing offending columns if any NaNs are present at fit or predict time.

Concrete implementations¶

Class	Config key	File	Hyperparameters (defaults)	Persistence format	When to use
`RidgeRegressor`	`ridge`	`ridge_regressor.py:19`	`alpha=1.0` (any `Ridge` kwarg forwarded)	`ridge_model.pkl` — joblib payload `{"model": Ridge, "feature_columns": tuple, "params": dict}`	Fast, interpretable baseline; recommended for initial experiments and ablations
`PcaRidgeRegressor`	`pca_ridge`	`pca_ridge_regressor.py:65`	`n_components=2`, `alpha=1.0`; `Ridge(fit_intercept=False)` because `StandardScaler` centres	`pca_ridge_model.pkl` — joblib payload `{"pipeline": Pipeline, "params": dict, "feature_columns": tuple}`	High-dimensional feature spaces where collinearity or overfitting is a concern; PCA components are sign-canonicalised across CV folds
`XGBRegressor`	`xgboost`	`xgb_regressor.py:41`	`n_estimators=20`, `max_depth=12`, `learning_rate=0.04`, `min_child_weight=1`, `subsample=0.85`, `colsample_bytree=0.85`, `early_stopping_rounds=10`, `eval_metric="mae"`	`model.json` (XGBoost native) + `metadata.pkl` (joblib `{"feature_columns": tuple, "params": dict}`)	Non-linear interactions; grid search available via `run_hyper_params=True`

Factory: build_regressor(config) at models/regression/__init__.py:58 reads config.model.regression, validates feature columns against commodity.feature_cols, enforces nan_policy='raise', then constructs the appropriate class. Raises ValueError on any unknown key.

XGB hyperparameter grid¶

When run_hyper_params=True, an exhaustive grid search is run over XGB_PARAM_GRID (xgb_regressor.py:22):

XGB_PARAM_GRID = {
    "n_estimators":     (5, 10, 15, 20),
    "max_depth":        (4, 6, 8, 10, 12),
    "learning_rate":    (0.03, 0.04, 0.05, 0.06),
    "min_child_weight": (1, 3, 5),
    "subsample":        (0.85,),
    "colsample_bytree": (0.85,),
}

Grid size: 4 × 5 × 4 × 3 × 1 × 1 = 240 trials. Selection criterion is validation MAE (xgb_regressor.py:118). The validation split uses the latest year in val_split_column (default "year"); falls back to a random 80/20 split. This mirrors the original usda_wasde_experiment/train_feats_v2.py single-holdout path.

PCA component-sign canonicalisation¶

PcaRidgeRegressor._orient_components_by_ridge_sign (pca_ridge_regressor.py:37) resolves the sign ambiguity introduced by sklearn's svd_flip. After fitting, for each PC i where ridge.coef_[i] < 0, both pca.components_[i] and ridge.coef_[i] are negated together. Because only the product components_.T @ coef_ affects predictions, this double-flip is algebraically invariant — predict() output is bit-exact unchanged, but PC axes consistently point in the direction of increasing predicted yield across CV folds.

Persistence notes¶

Ridge and PCA-Ridge use a single joblib.dump file; S3 targets are handled transparently via local_context_for_folder.
XGBoost requires two files (model.json + metadata.pkl) because neither the XGBoost native serialiser nor joblib accept S3 URIs directly; both are staged to a temp directory then mirrored via local_context_for_folder (xgb_regressor.py:267).
HindcastSlice.load_model probes all three registered classes in order (PcaRidge → Ridge → XGB) to handle heterogeneous run directories (results_slice.py:182).

Lifecycle¶

Instantiation: build_regressor(config) at models/regression/__init__.py:58, called from ExperimentConfig.build_regressor() or directly by FIT stage code.

Fit: run_fit calls model.fit(X_detrended, y_detrended, sample_weight=...) on the training fold. Rows where y is NaN are dropped inside fit before training. Returns Self.

Persist: model.save_model(slice.model_path) writes the fitted state to models/{commodity}/{fold_label}/.

Rehydrate: HindcastSlice.load_model(config) probes the three classes in order.

Predict: runtime.predict() calls model.predict(X_features) as step 3 of the chain: resolve feature columns → impute → predict → weather correction → inverse detrend.

Tear-down: None.

Relationships¶

AbstractDetrend (Detrender) — fit_transform produces the detrended feature table the regressor trains on; inverse_transform retrend the regressor's output.
runtime.predict (models/regression/runtime.py:140) — the pipeline-level orchestration wrapper; calls model.predict and chains weather correction and inverse detrend.
ExperimentConfig — carries model.regression (dispatch key), model.regression_params, and commodity.feature_cols.
HindcastSlice — owns model_path; provides load_model(config).
MedianImputer (lib/edit_and_imputation/imputation.py:73) — optional imputer injected into runtime.prepare_model_input; runs before the regressor boundary.
FitAggregationPolicy (lib/geo/aggregation.py:147) — optional ADM-level aggregation of weather-correction residuals, applied before or after regressor inference depending on weather_correction_fit_level.

Concepts and pipelines¶

source_regression — detailed module-by-module breakdown.
Walk-forward cross-validation — the regressor is re-fitted on each fold's training data and a new model artefact is written per fold.
Weather correction post-processing — apply_weather_correction_postprocess (runtime.py:40) scales the regressor's output by weather_correction_weight, applies season_doy_weather_weight interpolation, and clips by max_abs_weather_correction_bu_ac.

PRs and commits¶

No regressor-specific PRs identified in the recent log. The XGBoost hyperparameter grid traces to the original usda_wasde_experiment/train_feats_v2.py lineage.

Open questions¶

run_hyper_params=True triggers 240 full XGBoost fits per fold, which is expensive in the walk-forward setting. There is no budget or time limit on the grid search.
PcaRidgeRegressor logs timing checkpoints at DEBUG level (pca_ridge_regressor.py:127); these are not surfaced in MLflow metrics.
The val_split_column fallback to random 80/20 in XGBRegressor.fit means early-stopping results are not reproducible across runs when the latest year has no data.