Concept: Experiment Protocol¶
What it is¶
The experiment protocol governs how historical data is partitioned into folds for
walk-forward evaluation. It is the bridge between the raw feature parquets
(fit.parquet / pred.parquet) and the per-fold train/test splits that drive
the hindcast pipeline. The package uses an ExpandingFoldGenerator (the only
concrete implementation today) which, for each entry in test_years, constructs
a training set from all rows with year < test_year and a test set from rows with
year == test_year. Because the training set grows by one year at each step the
window is "expanding" — the model for fold 2022 has seen more data than the model
for fold 2018, mirroring how the model would have been deployed in production.
After all numeric folds, a final production fold trains on the entire available dataset (no holdout). The production fold is not an evaluation fold — it exists solely to produce the model artefacts consumed by the forecast pipeline.
Where it lives in the code¶
run/experiment_protocol.py—ExpandingFoldGenerator(line 110),run_experiment(line 22), andAbstractFoldGenerator(line 92).config.py:ExperimentProtocolConfig(line 483) — the Pydantic configuration aggregate carryingtest_years,cv_strategy,production_cumulative_threshold, andproduction_recent_years.
Fold types¶
Numeric folds — one per entry in config.experiment_protocol.test_years (e.g.
[2018, 2019, 2020, 2021, 2022]). The fold label is str(test_year), e.g. "2022".
Each fold produces:
- A fitted model under models/{experiment_key}/{fold_label}/.
- train_preds.parquet — in-sample predictions on the fold's training rows, used by
in_sample_pooled conformal calibration.
- walk_forward_preds.parquet — rolling out-of-sample predictions across every
init_date in the held-out season_year, used by all three hindcast_oos_*
residual modes.
Production fold — always named "production"; trains on all available data
including the most recent season. It has no held-out evaluation year, so
obs_yield_kg_ha is NaN in walk_forward_preds.parquet. It is accessed via
ExperimentResult.production, not via ExperimentResult.hindcast_slices.
Walk-forward semantics¶
At fold k, ExpandingFoldGenerator.generate_folds applies a strict filter:
train_data = self.fit_df[self.fit_df["year"] < ty] # strictly before test year
test_data = self.fit_df[self.fit_df["year"] == ty]
year_data = self.pred_df[self.pred_df["year"] == ty]
year_data contains all configured init_date values for the held-out year and
feeds the within-fold rolling prediction sweep in run/runner._predict_fold_rolling.
The fold yields a references_fold dict ({spec.name: DataFrame}) scoped to
marketing_year == test_year so diagnostic consumers can look up reference benchmarks
without re-loading the source file.
Key invariants¶
- The production fold is always present in an
ExperimentResultafter a full hindcast run (it is the final call in_run_walk_forward_phasefollowed by_run_production_fit_phase). - Numeric folds are ordered by
season_yearascending;test_yearsis iterated in sorted order. - Each fold has a
cutoffproperty onHindcastSlice:date(int(fold_label), 1, 1)for numeric folds;date(feature_end_year + 1, 1, 1)for the production fold (lib/results/results_slice.py:151). run_experiment(experiment_protocol.py:22) fits the model and persiststrain_preds.parquetbut does not run the per-init_daterolling sweep — that isrun/runner._predict_fold_rolling.- County selection (via
production_cumulative_thresholdandproduction_recent_years) is applied once in_load_and_preprocessbefore the fold loop, not per fold.
How it interacts with the pipeline¶
run_hindcast._run_walk_forward_phase constructs an ExpandingFoldGenerator and passes
it to run_walk_forward (run/runner.py:27). For each fold yielded:
run_experiment(fold_label, train_data, config)fits the detrender + regressor and writestrain_preds.parquet._predict_fold_rollingsweeps everyinit_datein the fold'syear_datain memory and writeswalk_forward_preds.parquetonce per fold.
The runner does NOT route through stages/run_predict.run_predict() because that
function has blind-overwrite semantics: K sequential calls would destroy K-1
init_date rows on disk, leaving only the final row. See
walk_forward_cv for detail.
After all folds, postprocess_experiment pools the OOS residuals from all numeric folds
to fit the hindcast_oos_* conformal calibrations
(run_meta_models.fit_and_save_all_configured).
Pitfalls¶
- MLflow SQLite locking: concurrent hindcast runs for the same commodity share an
MLflow tracking DB. Parallel runs cause
OperationalError. Run same-commodity pipelines sequentially (project MEMORY.md). - Wrong
test_yearsconfig: iftest_yearsomits a season year that has data infit.parquet, that year is silently used for training in all folds and never held out for OOS evaluation. Preflight does not catch this. fit_production-only run_dir: thefit_productionfast-path (run_hindcast.py:229) skips the walk-forward loop entirely. The resultingrun_dirhas a production fold but no numeric folds;ExperimentResult.hindcast_slicesis empty, conformal calibration can only usein_sample_pooledmode, andevaluate_experimentis a no-op.
Related entities and concepts¶
Entities:
- ExperimentProtocolConfig — Pydantic model
carrying test_years, cv_strategy, and production-fold county-selection thresholds
- HindcastSlice — per-fold artefact handle with cutoff
property
- ExperimentResult — aggregate root; discovers CV
slices from disk on load
- Fold — lightweight fold descriptor used by the dashboard layer
Concepts: - walk_forward_cv — expanding-window design, rolling prediction sweep, and production-fit semantics - conformal_modes — how OOS residuals from numeric folds feed the four calibration recipes
Pipelines:
- hindcast — the orchestrator that drives the fold loop
- fit — called per fold via run_experiment
- predict — called per (fold, init_date) inside
_predict_fold_rolling
Open questions¶
cv_strategyis declared as a plainstrrather thanLiteral["expanding"]. A sliding-window or fixed-window strategy would add a newAbstractFoldGeneratorsubclass here, but no implementation exists yet.- Should the production fold's
CalibrationResultdiffer from the OOS-pooled one used by numeric folds? Currently the production fold reuses the same calibration sidecars as the CV folds, which were fitted on OOS residuals from held-out years that the production model did not hold out. production_cumulative_thresholddefaults to1.0at the class level but all production YAML configs override it to0.90or0.95. The class default silently keeps all counties, which may inflate tail uncertainty; the intent was for every config to be explicit.