Concept: Experiment Protocol¶

What it is¶

The experiment protocol governs how historical data is partitioned into folds for walk-forward evaluation. It is the bridge between the raw feature parquets (fit.parquet / pred.parquet) and the per-fold train/test splits that drive the hindcast pipeline. The package uses an ExpandingFoldGenerator (the only concrete implementation today) which, for each entry in test_years, constructs a training set from all rows with year < test_year and a test set from rows with year == test_year. Because the training set grows by one year at each step the window is "expanding" — the model for fold 2022 has seen more data than the model for fold 2018, mirroring how the model would have been deployed in production.

After all numeric folds, a final production fold trains on the entire available dataset (no holdout). The production fold is not an evaluation fold — it exists solely to produce the model artefacts consumed by the forecast pipeline.

Where it lives in the code¶

run/experiment_protocol.py — ExpandingFoldGenerator (line 110), run_experiment (line 22), and AbstractFoldGenerator (line 92).
config.py:ExperimentProtocolConfig (line 483) — the Pydantic configuration aggregate carrying test_years, cv_strategy, production_cumulative_threshold, and production_recent_years.

Fold types¶

Numeric folds — one per entry in config.experiment_protocol.test_years (e.g. [2018, 2019, 2020, 2021, 2022]). The fold label is str(test_year), e.g. "2022". Each fold produces: - A fitted model under models/{experiment_key}/{fold_label}/. - train_preds.parquet — in-sample predictions on the fold's training rows, used by in_sample_pooled conformal calibration. - walk_forward_preds.parquet — rolling out-of-sample predictions across every init_date in the held-out season_year, used by all three hindcast_oos_* residual modes.

Production fold — always named "production"; trains on all available data including the most recent season. It has no held-out evaluation year, so obs_yield_kg_ha is NaN in walk_forward_preds.parquet. It is accessed via ExperimentResult.production, not via ExperimentResult.hindcast_slices.

Walk-forward semantics¶

At fold k, ExpandingFoldGenerator.generate_folds applies a strict filter:

train_data = self.fit_df[self.fit_df["year"] < ty]   # strictly before test year
test_data  = self.fit_df[self.fit_df["year"] == ty]
year_data  = self.pred_df[self.pred_df["year"] == ty]

year_data contains all configured init_date values for the held-out year and feeds the within-fold rolling prediction sweep in run/runner._predict_fold_rolling. The fold yields a references_fold dict ({spec.name: DataFrame}) scoped to marketing_year == test_year so diagnostic consumers can look up reference benchmarks without re-loading the source file.

Key invariants¶

The production fold is always present in an ExperimentResult after a full hindcast run (it is the final call in _run_walk_forward_phase followed by _run_production_fit_phase).
Numeric folds are ordered by season_year ascending; test_years is iterated in sorted order.
Each fold has a cutoff property on HindcastSlice: date(int(fold_label), 1, 1) for numeric folds; date(feature_end_year + 1, 1, 1) for the production fold (lib/results/results_slice.py:151).
run_experiment (experiment_protocol.py:22) fits the model and persists train_preds.parquet but does not run the per-init_date rolling sweep — that is run/runner._predict_fold_rolling.
County selection (via production_cumulative_threshold and production_recent_years) is applied once in _load_and_preprocess before the fold loop, not per fold.

How it interacts with the pipeline¶

run_hindcast._run_walk_forward_phase constructs an ExpandingFoldGenerator and passes it to run_walk_forward (run/runner.py:27). For each fold yielded:

run_experiment(fold_label, train_data, config) fits the detrender + regressor and writes train_preds.parquet.
_predict_fold_rolling sweeps every init_date in the fold's year_data in memory and writes walk_forward_preds.parquet once per fold.

The runner does NOT route through stages/run_predict.run_predict() because that function has blind-overwrite semantics: K sequential calls would destroy K-1 init_date rows on disk, leaving only the final row. See walk_forward_cv for detail.

After all folds, postprocess_experiment pools the OOS residuals from all numeric folds to fit the hindcast_oos_* conformal calibrations (run_meta_models.fit_and_save_all_configured).

Pitfalls¶

MLflow SQLite locking: concurrent hindcast runs for the same commodity share an MLflow tracking DB. Parallel runs cause OperationalError. Run same-commodity pipelines sequentially (project MEMORY.md).
Wrong test_years config: if test_years omits a season year that has data in fit.parquet, that year is silently used for training in all folds and never held out for OOS evaluation. Preflight does not catch this.
fit_production-only run_dir: the fit_production fast-path (run_hindcast.py:229) skips the walk-forward loop entirely. The resulting run_dir has a production fold but no numeric folds; ExperimentResult.hindcast_slices is empty, conformal calibration can only use in_sample_pooled mode, and evaluate_experiment is a no-op.

Entities: - ExperimentProtocolConfig — Pydantic model carrying test_years, cv_strategy, and production-fold county-selection thresholds - HindcastSlice — per-fold artefact handle with cutoff property - ExperimentResult — aggregate root; discovers CV slices from disk on load - Fold — lightweight fold descriptor used by the dashboard layer

Concepts: - walk_forward_cv — expanding-window design, rolling prediction sweep, and production-fit semantics - conformal_modes — how OOS residuals from numeric folds feed the four calibration recipes

Pipelines: - hindcast — the orchestrator that drives the fold loop - fit — called per fold via run_experiment - predict — called per (fold, init_date) inside _predict_fold_rolling

Open questions¶

cv_strategy is declared as a plain str rather than Literal["expanding"]. A sliding-window or fixed-window strategy would add a new AbstractFoldGenerator subclass here, but no implementation exists yet.
Should the production fold's CalibrationResult differ from the OOS-pooled one used by numeric folds? Currently the production fold reuses the same calibration sidecars as the CV folds, which were fitted on OOS residuals from held-out years that the production model did not hold out.
production_cumulative_threshold defaults to 1.0 at the class level but all production YAML configs override it to 0.90 or 0.95. The class default silently keeps all counties, which may inflate tail uncertainty; the intent was for every config to be explicit.