Concept: Walk-Forward Cross-Validation¶

What it is¶

Walk-forward CV is the hindcast pipeline's out-of-sample evaluation strategy. Unlike random k-fold splits, it respects temporal ordering: fold k trains on all data strictly before season_year = k, then predicts only on year k. Each fold sees a strictly growing training set — the "expanding window" property — which mirrors how the model would have been deployed in production.

The held-out years are config.experiment_protocol.test_years (e.g. [2018, 2019, 2020, 2021, 2022, 2023]). After all CV folds complete, a final production fit trains on every available year (no holdout) — the model consumed by the forecast pipeline.

Walk-forward CV serves two purposes:

Honest OOS scoring. Because each fold's predictions were never seen during training, resulting metrics (RMSE, skill score) are honest estimates of real-world skill.
Conformal calibration data. Per-fold OOS residuals (obs − sim) feed the three hindcast_oos_* residual modes. in_sample_pooled is the fallback for fit_production-only run_dirs. See conformal_modes.md.

Where it lives in the code¶

Component	Location	Purpose
`ExpandingFoldGenerator`	`run/experiment_protocol.py:110`	Yields `(fold_label, train_data, test_data, year_data, references_fold)` per fold
`run_walk_forward`	`run/runner.py:27`	Iterates folds; accumulates rolling preds in memory; writes once per fold
`run_experiment`	`run/experiment_protocol.py:22`	Fits detrender + model for one fold; writes `train_preds.parquet`
`_run_walk_forward_phase`	`stages/run_hindcast.py`	Sets up generator, calls `run_walk_forward`
`_run_production_fit_phase`	`stages/run_hindcast.py`	Trains with `fold_label="production"` on all data

ExpandingFoldGenerator¶

ExpandingFoldGenerator.generate_folds (experiment_protocol.py:135) yields per test year:

train_data = self.fit_df[self.fit_df["year"] < ty]   # strictly expanding
test_data  = self.fit_df[self.fit_df["year"] == ty]
year_data  = self.pred_df[self.pred_df["year"] == ty]
yield str(ty), train_data, test_data, year_data, references_fold

The fold label is the string representation of the test year (e.g. "2022"). year_data contains all init_date values in hindcast_init_season_doys and is the input to the within-fold rolling prediction sweep.

AbstractFoldGenerator (experiment_protocol.py:92) is the ABC; only the expanding strategy is used in production.

Rolling prediction sweep within each fold¶

After fitting via run_experiment, run_walk_forward calls _predict_fold_rolling (runner.py:86) which:

Loads the fold's artefacts from disk (load_model, load_detrender, load_feature_fill_values).
Iterates sorted(year_data['init_date'].unique()).
For each init_date, detrends the slice, runs the regression kernel, fills positional masks on an in-memory copy of year_data.
Converts to the wide walk_forward_preds schema and writes once per fold (runner.py:74).

The runner does NOT route through stages/run_predict.run_predict() because that function has blind-overwrite semantics (run_predict.py:334): K sequential calls would destroy K-1 init_dates on disk, leaving only the last 1/K of the rows. The rationale is documented at runner.py:36–49.

Production fit¶

After all CV folds, _run_production_fit_phase calls run_experiment with fold_label="production" and train_data = fit_df (all available years). The production HindcastSlice:

Is not in ExperimentResult.hindcast_slices (the CV tuple); accessed via ExperimentResult.production.
Is required before any ForecastSlice can delegate trained artefacts via ForecastSlice.training.
Has obs_yield_kg_ha = NaN in walk_forward_preds (no ground-truth at prediction time).

The fit_production fast-path (run_hindcast.py:229) runs only the production fit — no walk-forward CV, no postprocess, no evaluate, no deliver. It is the entry point when only in_sample_pooled calibration is required.

Key invariants¶

Training data for fold k contains only years < k — strict filter at experiment_protocol.py:140.
Walk-forward rolling predictions are accumulated in memory and written to disk once per fold (runner.py:74).
run_experiment fits the model and writes train_preds.parquet but does NOT run the per-init-date prediction sweep — that is _predict_fold_rolling.
DESIGN.md line 100: "no in-memory state crosses a stage boundary." The runner loop is single-stage; cross-stage handoff happens via disk.

How it interacts with the pipeline¶

Walk-forward CV is entirely a hindcast concern; the forecast pipeline does not iterate folds. Artefacts from each CV fold are encapsulated in HindcastSlice. The complete set forms the ExperimentResult aggregate root. After all folds, postprocess_experiment pools CV residuals for conformal calibration (fit_and_save_all_configured, run_meta_models.py:85).

See the hindcast pipeline (forward ref: ../pipelines/hindcast.md) for the end-to-end walkthrough.

Pitfalls and historical bugs¶

Blind-overwrite in write_walk_forward_outputs: run_predict.write_walk_forward_outputs replaces the entire parquet on each call (run_predict.py:334). The runner avoids routing through it for walk-forward precisely to prevent K sequential calls leaving only the last init_date on disk (runner.py:36–49).

HindcastSlice — per-fold artefact handle
ExperimentResult — discovers CV slices on load
conformal_modes.md — how CV residuals feed calibration
hindcast_vs_forecast.md — hindcast/forecast separation

PRs and commits¶

No single PR introduced walk-forward CV; the mechanism predates the PR history captured here. runner.py and experiment_protocol.py were co-located in the restructure surrounding the captured PRs.

Open questions¶

ExpandingFoldGenerator is the only concrete implementation; fixed-window variants could be valuable but are not currently registered.
ExpandingFoldGenerator yields references_fold per fold; its schema and contract ({spec.name: DataFrame}) are not documented in the entity pages.