Skip to content

Concept: Walk-Forward Cross-Validation

What it is

Walk-forward CV is the hindcast pipeline's out-of-sample evaluation strategy. Unlike random k-fold splits, it respects temporal ordering: fold k trains on all data strictly before season_year = k, then predicts only on year k. Each fold sees a strictly growing training set — the "expanding window" property — which mirrors how the model would have been deployed in production.

The held-out years are config.experiment_protocol.test_years (e.g. [2018, 2019, 2020, 2021, 2022, 2023]). After all CV folds complete, a final production fit trains on every available year (no holdout) — the model consumed by the forecast pipeline.

Walk-forward CV serves two purposes:

  1. Honest OOS scoring. Because each fold's predictions were never seen during training, resulting metrics (RMSE, skill score) are honest estimates of real-world skill.

  2. Conformal calibration data. Per-fold OOS residuals (obs − sim) feed the three hindcast_oos_* residual modes. in_sample_pooled is the fallback for fit_production-only run_dirs. See conformal_modes.md.

Where it lives in the code

Component Location Purpose
ExpandingFoldGenerator run/experiment_protocol.py:110 Yields (fold_label, train_data, test_data, year_data, references_fold) per fold
run_walk_forward run/runner.py:27 Iterates folds; accumulates rolling preds in memory; writes once per fold
run_experiment run/experiment_protocol.py:22 Fits detrender + model for one fold; writes train_preds.parquet
_run_walk_forward_phase stages/run_hindcast.py Sets up generator, calls run_walk_forward
_run_production_fit_phase stages/run_hindcast.py Trains with fold_label="production" on all data

ExpandingFoldGenerator

ExpandingFoldGenerator.generate_folds (experiment_protocol.py:135) yields per test year:

train_data = self.fit_df[self.fit_df["year"] < ty]   # strictly expanding
test_data  = self.fit_df[self.fit_df["year"] == ty]
year_data  = self.pred_df[self.pred_df["year"] == ty]
yield str(ty), train_data, test_data, year_data, references_fold

The fold label is the string representation of the test year (e.g. "2022"). year_data contains all init_date values in hindcast_init_season_doys and is the input to the within-fold rolling prediction sweep.

AbstractFoldGenerator (experiment_protocol.py:92) is the ABC; only the expanding strategy is used in production.

Rolling prediction sweep within each fold

After fitting via run_experiment, run_walk_forward calls _predict_fold_rolling (runner.py:86) which:

  1. Loads the fold's artefacts from disk (load_model, load_detrender, load_feature_fill_values).
  2. Iterates sorted(year_data['init_date'].unique()).
  3. For each init_date, detrends the slice, runs the regression kernel, fills positional masks on an in-memory copy of year_data.
  4. Converts to the wide walk_forward_preds schema and writes once per fold (runner.py:74).

The runner does NOT route through stages/run_predict.run_predict() because that function has blind-overwrite semantics (run_predict.py:334): K sequential calls would destroy K-1 init_dates on disk, leaving only the last 1/K of the rows. The rationale is documented at runner.py:36–49.

Production fit

After all CV folds, _run_production_fit_phase calls run_experiment with fold_label="production" and train_data = fit_df (all available years). The production HindcastSlice:

  • Is not in ExperimentResult.hindcast_slices (the CV tuple); accessed via ExperimentResult.production.
  • Is required before any ForecastSlice can delegate trained artefacts via ForecastSlice.training.
  • Has obs_yield_kg_ha = NaN in walk_forward_preds (no ground-truth at prediction time).

The fit_production fast-path (run_hindcast.py:229) runs only the production fit — no walk-forward CV, no postprocess, no evaluate, no deliver. It is the entry point when only in_sample_pooled calibration is required.

Key invariants

  • Training data for fold k contains only years < k — strict filter at experiment_protocol.py:140.
  • Walk-forward rolling predictions are accumulated in memory and written to disk once per fold (runner.py:74).
  • run_experiment fits the model and writes train_preds.parquet but does NOT run the per-init-date prediction sweep — that is _predict_fold_rolling.
  • DESIGN.md line 100: "no in-memory state crosses a stage boundary." The runner loop is single-stage; cross-stage handoff happens via disk.

How it interacts with the pipeline

Walk-forward CV is entirely a hindcast concern; the forecast pipeline does not iterate folds. Artefacts from each CV fold are encapsulated in HindcastSlice. The complete set forms the ExperimentResult aggregate root. After all folds, postprocess_experiment pools CV residuals for conformal calibration (fit_and_save_all_configured, run_meta_models.py:85).

See the hindcast pipeline (forward ref: ../pipelines/hindcast.md) for the end-to-end walkthrough.

Pitfalls and historical bugs

Blind-overwrite in write_walk_forward_outputs: run_predict.write_walk_forward_outputs replaces the entire parquet on each call (run_predict.py:334). The runner avoids routing through it for walk-forward precisely to prevent K sequential calls leaving only the last init_date on disk (runner.py:36–49).

PRs and commits

No single PR introduced walk-forward CV; the mechanism predates the PR history captured here. runner.py and experiment_protocol.py were co-located in the restructure surrounding the captured PRs.

Open questions

  • ExpandingFoldGenerator is the only concrete implementation; fixed-window variants could be valuable but are not currently registered.
  • ExpandingFoldGenerator yields references_fold per fold; its schema and contract ({spec.name: DataFrame}) are not documented in the entity pages.