Regressor¶
Definition¶
A Regressor fits a statistical model mapping detrended feature columns to detrended yield residuals, then produces predictions during inference. All three concrete implementations share the AbstractRegressionImpl ABC (models/regression/base.py:9). The regressor receives an already-detrended, fully-imputed feature matrix — trend removal and NaN imputation both happen upstream. The regressor itself enforces nan_policy='raise' and rejects any NaN input.
Detrending, weather correction, and re-trending are performed by the surrounding runtime.predict function (models/regression/runtime.py:140), not by the regressor. The regressor's responsibility is purely (X_detrended) → y_detrended.
Kind¶
ABC (AbstractRegressionImpl at models/regression/base.py:9).
Source of truth¶
market_insights_models/src/commodity_hindcast/models/regression/base.py:9
Required interface¶
class AbstractRegressionImpl(ABC):
def fit(
self,
X: pd.DataFrame,
y: pd.Series,
sample_weight: pd.Series | None = None,
) -> Self: ...
def predict(self, X: pd.DataFrame) -> pd.Series: ...
def save_model(self, path: Path) -> None: ...
@classmethod
def load_model(cls, path: Path) -> Self: ...
Shared helper functions defined at module level in base.py:
| Function | Location | Purpose |
|---|---|---|
require_feature_columns |
base.py:41 |
Pops feature_columns from params dict; raises if absent — enforces construction via build_regressor(). |
require_nan_policy |
base.py:55 |
Enforces nan_policy='raise'; imputation must precede the regressor. |
select_feature_frame |
base.py:69 |
Selects and copies declared feature columns from X; raises KeyError on any mismatch. |
assert_no_nan_features |
base.py:81 |
Raises ValueError listing offending columns if any NaNs are present at fit or predict time. |
Concrete implementations¶
| Class | Config key | File | Hyperparameters (defaults) | Persistence format | When to use |
|---|---|---|---|---|---|
RidgeRegressor |
ridge |
ridge_regressor.py:19 |
alpha=1.0 (any Ridge kwarg forwarded) |
ridge_model.pkl — joblib payload {"model": Ridge, "feature_columns": tuple, "params": dict} |
Fast, interpretable baseline; recommended for initial experiments and ablations |
PcaRidgeRegressor |
pca_ridge |
pca_ridge_regressor.py:65 |
n_components=2, alpha=1.0; Ridge(fit_intercept=False) because StandardScaler centres |
pca_ridge_model.pkl — joblib payload {"pipeline": Pipeline, "params": dict, "feature_columns": tuple} |
High-dimensional feature spaces where collinearity or overfitting is a concern; PCA components are sign-canonicalised across CV folds |
XGBRegressor |
xgboost |
xgb_regressor.py:41 |
n_estimators=20, max_depth=12, learning_rate=0.04, min_child_weight=1, subsample=0.85, colsample_bytree=0.85, early_stopping_rounds=10, eval_metric="mae" |
model.json (XGBoost native) + metadata.pkl (joblib {"feature_columns": tuple, "params": dict}) |
Non-linear interactions; grid search available via run_hyper_params=True |
Factory: build_regressor(config) at models/regression/__init__.py:58 reads config.model.regression, validates feature columns against commodity.feature_cols, enforces nan_policy='raise', then constructs the appropriate class. Raises ValueError on any unknown key.
XGB hyperparameter grid¶
When run_hyper_params=True, an exhaustive grid search is run over XGB_PARAM_GRID (xgb_regressor.py:22):
XGB_PARAM_GRID = {
"n_estimators": (5, 10, 15, 20),
"max_depth": (4, 6, 8, 10, 12),
"learning_rate": (0.03, 0.04, 0.05, 0.06),
"min_child_weight": (1, 3, 5),
"subsample": (0.85,),
"colsample_bytree": (0.85,),
}
Grid size: 4 × 5 × 4 × 3 × 1 × 1 = 240 trials. Selection criterion is validation MAE (xgb_regressor.py:118). The validation split uses the latest year in val_split_column (default "year"); falls back to a random 80/20 split. This mirrors the original usda_wasde_experiment/train_feats_v2.py single-holdout path.
PCA component-sign canonicalisation¶
PcaRidgeRegressor._orient_components_by_ridge_sign (pca_ridge_regressor.py:37) resolves the sign ambiguity introduced by sklearn's svd_flip. After fitting, for each PC i where ridge.coef_[i] < 0, both pca.components_[i] and ridge.coef_[i] are negated together. Because only the product components_.T @ coef_ affects predictions, this double-flip is algebraically invariant — predict() output is bit-exact unchanged, but PC axes consistently point in the direction of increasing predicted yield across CV folds.
Persistence notes¶
- Ridge and PCA-Ridge use a single
joblib.dumpfile; S3 targets are handled transparently vialocal_context_for_folder. - XGBoost requires two files (
model.json+metadata.pkl) because neither the XGBoost native serialiser norjoblibaccept S3 URIs directly; both are staged to a temp directory then mirrored vialocal_context_for_folder(xgb_regressor.py:267). HindcastSlice.load_modelprobes all three registered classes in order (PcaRidge → Ridge → XGB) to handle heterogeneous run directories (results_slice.py:182).
Lifecycle¶
Instantiation: build_regressor(config) at models/regression/__init__.py:58, called from ExperimentConfig.build_regressor() or directly by FIT stage code.
Fit: run_fit calls model.fit(X_detrended, y_detrended, sample_weight=...) on the training fold. Rows where y is NaN are dropped inside fit before training. Returns Self.
Persist: model.save_model(slice.model_path) writes the fitted state to models/{commodity}/{fold_label}/.
Rehydrate: HindcastSlice.load_model(config) probes the three classes in order.
Predict: runtime.predict() calls model.predict(X_features) as step 3 of the chain: resolve feature columns → impute → predict → weather correction → inverse detrend.
Tear-down: None.
Relationships¶
AbstractDetrend(Detrender) —fit_transformproduces the detrended feature table the regressor trains on;inverse_transformretrend the regressor's output.runtime.predict(models/regression/runtime.py:140) — the pipeline-level orchestration wrapper; callsmodel.predictand chains weather correction and inverse detrend.ExperimentConfig— carriesmodel.regression(dispatch key),model.regression_params, andcommodity.feature_cols.HindcastSlice— ownsmodel_path; providesload_model(config).MedianImputer(lib/edit_and_imputation/imputation.py:73) — optional imputer injected intoruntime.prepare_model_input; runs before the regressor boundary.FitAggregationPolicy(lib/geo/aggregation.py:147) — optional ADM-level aggregation of weather-correction residuals, applied before or after regressor inference depending onweather_correction_fit_level.
Concepts and pipelines¶
- source_regression — detailed module-by-module breakdown.
- Walk-forward cross-validation — the regressor is re-fitted on each fold's training data and a new model artefact is written per fold.
- Weather correction post-processing —
apply_weather_correction_postprocess(runtime.py:40) scales the regressor's output byweather_correction_weight, appliesseason_doy_weather_weightinterpolation, and clips bymax_abs_weather_correction_bu_ac.
PRs and commits¶
No regressor-specific PRs identified in the recent log. The XGBoost hyperparameter grid traces to the original usda_wasde_experiment/train_feats_v2.py lineage.
Open questions¶
run_hyper_params=Truetriggers 240 full XGBoost fits per fold, which is expensive in the walk-forward setting. There is no budget or time limit on the grid search.PcaRidgeRegressorlogs timing checkpoints atDEBUGlevel (pca_ridge_regressor.py:127); these are not surfaced in MLflow metrics.- The
val_split_columnfallback to random 80/20 inXGBRegressor.fitmeans early-stopping results are not reproducible across runs when the latest year has no data.