Skip to content

Regressor

Definition

A Regressor fits a statistical model mapping detrended feature columns to detrended yield residuals, then produces predictions during inference. All three concrete implementations share the AbstractRegressionImpl ABC (models/regression/base.py:9). The regressor receives an already-detrended, fully-imputed feature matrix — trend removal and NaN imputation both happen upstream. The regressor itself enforces nan_policy='raise' and rejects any NaN input.

Detrending, weather correction, and re-trending are performed by the surrounding runtime.predict function (models/regression/runtime.py:140), not by the regressor. The regressor's responsibility is purely (X_detrended) → y_detrended.

Kind

ABC (AbstractRegressionImpl at models/regression/base.py:9).

Source of truth

market_insights_models/src/commodity_hindcast/models/regression/base.py:9

Required interface

class AbstractRegressionImpl(ABC):
    def fit(
        self,
        X: pd.DataFrame,
        y: pd.Series,
        sample_weight: pd.Series | None = None,
    ) -> Self: ...

    def predict(self, X: pd.DataFrame) -> pd.Series: ...

    def save_model(self, path: Path) -> None: ...

    @classmethod
    def load_model(cls, path: Path) -> Self: ...

Shared helper functions defined at module level in base.py:

Function Location Purpose
require_feature_columns base.py:41 Pops feature_columns from params dict; raises if absent — enforces construction via build_regressor().
require_nan_policy base.py:55 Enforces nan_policy='raise'; imputation must precede the regressor.
select_feature_frame base.py:69 Selects and copies declared feature columns from X; raises KeyError on any mismatch.
assert_no_nan_features base.py:81 Raises ValueError listing offending columns if any NaNs are present at fit or predict time.

Concrete implementations

Class Config key File Hyperparameters (defaults) Persistence format When to use
RidgeRegressor ridge ridge_regressor.py:19 alpha=1.0 (any Ridge kwarg forwarded) ridge_model.pkl — joblib payload {"model": Ridge, "feature_columns": tuple, "params": dict} Fast, interpretable baseline; recommended for initial experiments and ablations
PcaRidgeRegressor pca_ridge pca_ridge_regressor.py:65 n_components=2, alpha=1.0; Ridge(fit_intercept=False) because StandardScaler centres pca_ridge_model.pkl — joblib payload {"pipeline": Pipeline, "params": dict, "feature_columns": tuple} High-dimensional feature spaces where collinearity or overfitting is a concern; PCA components are sign-canonicalised across CV folds
XGBRegressor xgboost xgb_regressor.py:41 n_estimators=20, max_depth=12, learning_rate=0.04, min_child_weight=1, subsample=0.85, colsample_bytree=0.85, early_stopping_rounds=10, eval_metric="mae" model.json (XGBoost native) + metadata.pkl (joblib {"feature_columns": tuple, "params": dict}) Non-linear interactions; grid search available via run_hyper_params=True

Factory: build_regressor(config) at models/regression/__init__.py:58 reads config.model.regression, validates feature columns against commodity.feature_cols, enforces nan_policy='raise', then constructs the appropriate class. Raises ValueError on any unknown key.

XGB hyperparameter grid

When run_hyper_params=True, an exhaustive grid search is run over XGB_PARAM_GRID (xgb_regressor.py:22):

XGB_PARAM_GRID = {
    "n_estimators":     (5, 10, 15, 20),
    "max_depth":        (4, 6, 8, 10, 12),
    "learning_rate":    (0.03, 0.04, 0.05, 0.06),
    "min_child_weight": (1, 3, 5),
    "subsample":        (0.85,),
    "colsample_bytree": (0.85,),
}

Grid size: 4 × 5 × 4 × 3 × 1 × 1 = 240 trials. Selection criterion is validation MAE (xgb_regressor.py:118). The validation split uses the latest year in val_split_column (default "year"); falls back to a random 80/20 split. This mirrors the original usda_wasde_experiment/train_feats_v2.py single-holdout path.

PCA component-sign canonicalisation

PcaRidgeRegressor._orient_components_by_ridge_sign (pca_ridge_regressor.py:37) resolves the sign ambiguity introduced by sklearn's svd_flip. After fitting, for each PC i where ridge.coef_[i] < 0, both pca.components_[i] and ridge.coef_[i] are negated together. Because only the product components_.T @ coef_ affects predictions, this double-flip is algebraically invariant — predict() output is bit-exact unchanged, but PC axes consistently point in the direction of increasing predicted yield across CV folds.

Persistence notes

  • Ridge and PCA-Ridge use a single joblib.dump file; S3 targets are handled transparently via local_context_for_folder.
  • XGBoost requires two files (model.json + metadata.pkl) because neither the XGBoost native serialiser nor joblib accept S3 URIs directly; both are staged to a temp directory then mirrored via local_context_for_folder (xgb_regressor.py:267).
  • HindcastSlice.load_model probes all three registered classes in order (PcaRidge → Ridge → XGB) to handle heterogeneous run directories (results_slice.py:182).

Lifecycle

Instantiation: build_regressor(config) at models/regression/__init__.py:58, called from ExperimentConfig.build_regressor() or directly by FIT stage code.

Fit: run_fit calls model.fit(X_detrended, y_detrended, sample_weight=...) on the training fold. Rows where y is NaN are dropped inside fit before training. Returns Self.

Persist: model.save_model(slice.model_path) writes the fitted state to models/{commodity}/{fold_label}/.

Rehydrate: HindcastSlice.load_model(config) probes the three classes in order.

Predict: runtime.predict() calls model.predict(X_features) as step 3 of the chain: resolve feature columns → impute → predict → weather correction → inverse detrend.

Tear-down: None.

Relationships

  • AbstractDetrend (Detrender) — fit_transform produces the detrended feature table the regressor trains on; inverse_transform retrend the regressor's output.
  • runtime.predict (models/regression/runtime.py:140) — the pipeline-level orchestration wrapper; calls model.predict and chains weather correction and inverse detrend.
  • ExperimentConfig — carries model.regression (dispatch key), model.regression_params, and commodity.feature_cols.
  • HindcastSlice — owns model_path; provides load_model(config).
  • MedianImputer (lib/edit_and_imputation/imputation.py:73) — optional imputer injected into runtime.prepare_model_input; runs before the regressor boundary.
  • FitAggregationPolicy (lib/geo/aggregation.py:147) — optional ADM-level aggregation of weather-correction residuals, applied before or after regressor inference depending on weather_correction_fit_level.

Concepts and pipelines

  • source_regression — detailed module-by-module breakdown.
  • Walk-forward cross-validation — the regressor is re-fitted on each fold's training data and a new model artefact is written per fold.
  • Weather correction post-processing — apply_weather_correction_postprocess (runtime.py:40) scales the regressor's output by weather_correction_weight, applies season_doy_weather_weight interpolation, and clips by max_abs_weather_correction_bu_ac.

PRs and commits

No regressor-specific PRs identified in the recent log. The XGBoost hyperparameter grid traces to the original usda_wasde_experiment/train_feats_v2.py lineage.

Open questions

  • run_hyper_params=True triggers 240 full XGBoost fits per fold, which is expensive in the walk-forward setting. There is no budget or time limit on the grid search.
  • PcaRidgeRegressor logs timing checkpoints at DEBUG level (pca_ridge_regressor.py:127); these are not surfaced in MLflow metrics.
  • The val_split_column fallback to random 80/20 in XGBRegressor.fit means early-stopping results are not reproducible across runs when the latest year has no data.