Pipeline: hindcast¶
Purpose¶
The hindcast pipeline is the end-to-end orchestrator that drives a full historical
evaluation of a commodity model. It consumes pre-built feature parquets (fit.parquet
and pred.parquet), then executes the full CV-plus-production sequence: each past
season in config.experiment_protocol.test_years is held out in turn, the model is
trained on all prior seasons, and rolling predictions are generated for every configured
init_date in the held-out year. Once all CV folds are complete, a second fit trains on
the entire available dataset (the "production" fold) to produce the model consumed by the
forecast pipeline. The accumulated walk-forward predictions, together with the production
model, are packaged into an ExperimentResult on
disk, which is then post-processed (conformal CIs and bias correction), evaluated
(metrics and diagnostic plots), and delivered as client-facing CSVs at ADM0, ADM1, and
ADM2 resolution. The pipeline is invoked via cli run hindcast or the make hindcast
Make target.
Inputs¶
| Input | Path | Format | Producer |
|---|---|---|---|
| Training features | {features_dir}/{experiment_key}/fit.parquet |
Parquet | cli run features |
| Scoring features | {features_dir}/{experiment_key}/pred.parquet |
Parquet | cli run features |
| Resolved config | Set via COMMODITY_HINDCAST_CONFIG env var |
YAML | Caller / Makefile |
| Reference yield series (WASDE, CONAB, …) | Paths declared in config.reference_data[*].filepath |
Parquet / CSV | External data pipeline |
| Raw input data (NASS, weather, climo) | Paths declared as ResolvablePath fields on config |
Parquet / Zarr / CSV | External sources |
Outputs¶
All outputs land under a fresh timestamped run_dir:
{config.run_dir_base}/{YYYYMMDD_HHMMSS}_{experiment_key}/
| Output | Path | Format | Consumer |
|---|---|---|---|
| Resolved config snapshot | run_dir/config_resolved.yaml |
YAML | Audit; ExperimentResult.from_run_dir |
| Fold model artefacts | run_dir/models/{experiment_key}/{fold_label}/ |
Pickle / Parquet / JSON | run_predict, forecast pipeline |
| Production model artefacts | run_dir/models/{experiment_key}/production/ |
Pickle / Parquet / JSON | Forecast pipeline |
| CV fold predictions | run_dir/preds/{experiment_key}/{fold_label}/walk_forward_preds.parquet |
Parquet | postprocess, deliver, evaluate |
| Consolidated train predictions | run_dir/preds/{experiment_key}/train_preds.parquet |
Parquet | postprocess (in-sample conformal) |
| Included county set | run_dir/included_geo_identifiers.txt |
Text | run_predict, postprocess |
| Conformal calibration sidecars | run_dir/conformal/{mode}.parquet |
Parquet | Forecast pipeline, delivery |
| Per-fold bias correctors | run_dir/preds/{experiment_key}/{fold_label}/bias_corrector.pkl |
Pickle | deliver, evaluate |
| National postprocessed frame | run_dir/postprocessed/national.parquet |
Parquet | deliver, dashboard |
| Metrics artefacts | run_dir/reports/metrics_table.csv, stage5_metrics*.txt |
CSV / Text | Stakeholder review |
| Diagnostic plots | run_dir/reports/*.png, run_dir/reports/hindcast/*.png |
PNG | Stakeholder review |
| Delivery CSVs | run_dir/delivery/Treefera_{key}_{ADM}_Hindcast_{YYYYMMDD}.csv |
CSV | QUBE / client |
Step-by-step flow¶
-
Config resolution —
_prepare_config(config_path)(run_hindcast.py:64) resolves the config path, setsCOMMODITY_HINDCAST_CONFIGin the environment, and constructs theExperimentConfigvia Pydantic. -
Preflight —
run_preflight(preflight_paths_for_hindcast(config))(run_hindcast.py:201) checks thatfit.parquetandpred.parquetexist plus allResolvablePathfields declared on the config. Halts withSystemExiton the first missing path. See pipeline: preflight for detail. -
Run root creation —
_create_run_root(config)(run_hindcast.py:77) creates the timestamped directory, rewritesconfig.models_dirandconfig.preds_dirinto it, and writesconfig_resolved.yaml. -
MLflow context —
prepare_hindcast_mlflowandhindcast_mlflow_runopen a tracking context that logs params and artefacts for the entire run. -
Load and preprocess —
_load_and_preprocess(config)(run_hindcast.py:101) readsfit.parquetandpred.parquet, callspreprocess_data(column guard and state-name enrichment), and appliesselect_by_productionto restrict both frames to the top counties by recent production. Reference yield series are loaded viabuild_references_by_harvest_year. -
Persist included counties —
_persist_included(fit_data, run_root)(run_hindcast.py:154) derives thefrozenset[str]of includedgeo_identifiervalues and writes it toincluded_geo_identifiers.txt; subsequent per-fold predict calls read this file viaExperimentResult.load_included_geo_identifiers. -
Walk-forward CV phase —
_run_walk_forward_phase(config, fit_data, pred_data, references_by_harvest)(run_hindcast.py:161): ExpandingFoldGenerator(run/experiment_protocol.py:110) yields one(fold_label, train_data, test_data, year_data, references_fold)tuple per entry inconfig.experiment_protocol.test_years, withtrain_data = fit_df[year < test_year].run_walk_forward(config, data_fold_generator)(run/runner.py:27) iterates folds. For each fold,run_experiment(fold_label, train_data, config)trains the detrender + imputer + regressor (run_fit.train) and writestrain_preds.parquet._predict_fold_rollingthen sweeps everyinit_datein memory and writeswalk_forward_preds.parquetonce per fold (blind-overwrite safe because allinit_datevalues are accumulated before the single write).-
Per-fold
train_preds.parquetfiles are concatenated into a singlepreds/{key}/train_preds.parquet. -
Production fit phase —
_run_production_fit_phase(config, fit_data)(run_hindcast.py:187) callsrun_experiment(fold_label="production", train_data=fit_data, config=config), training on all available data with no holdout. -
Postprocess —
postprocess_experiment(run_root, included_geo_identifiers=included)(stages/run_meta_models.py:138). Fits conformal calibration sidecars for every mode inconfig.postprocess.conformalise, fits and persists per-fold bias correctors, and writespostprocessed/national.parquet. See pipeline: postprocess. -
Evaluate —
evaluate_experiment(run_root)(stages/run_diagnostics.py:12). Computes per-fold OOS metrics, writesmetrics_table.csvandstage5_metrics*.txt, generates diagnostic PNGs. See pipeline: evaluate. -
Deliver —
deliver_experiment(run_root)(stages/run_deliver.py:40). Aggregates CV fold predictions to ADM0/ADM1/ADM2, converts units (kg/ha → bu/ac), attaches CI bands, validates rows via Pydantic, and writes three delivery CSVs. See pipeline: deliver. -
Return —
run.runreturnsrun_rootso the caller (Makefile or CLI) can chaincli run forecast --run-dir {run_root}against the same directory.
Mermaid flow diagram¶
flowchart LR
INPUT["fit.parquet\npred.parquet\n(features_dir)"]
PREFLIGHT["PREFLIGHT\npreflight_paths_for_hindcast\nrun_hindcast.py:201"]
RUN_DIR["_create_run_root\nrun_hindcast.py:77\n→ RUN_DIR"]
PREPROC["_load_and_preprocess\npreprocess_data +\nselect_by_production"]
subgraph WF["Walk-forward CV loop\n_run_walk_forward_phase"]
direction TB
FG["ExpandingFoldGenerator\nexperiment_protocol.py:110"]
FOLD_FIT["run_experiment\n→ train() + train_preds.parquet"]
FOLD_PRED["_predict_fold_rolling\n→ walk_forward_preds.parquet"]
FG --> FOLD_FIT --> FOLD_PRED
end
PROD_FIT["PRODUCTION FIT\n_run_production_fit_phase\nfold_label='production'"]
POSTPROCESS["POSTPROCESS\npostprocess_experiment\n→ conformal/ + national.parquet"]
EVALUATE["EVALUATE\nevaluate_experiment\n→ reports/"]
DELIVER["DELIVER\ndeliver_experiment\n→ delivery/Treefera_*.csv"]
INPUT --> PREFLIGHT --> RUN_DIR --> PREPROC --> WF --> PROD_FIT --> POSTPROCESS --> EVALUATE --> DELIVER
RUN_DIR -. "persistence root\nfor all artefacts" .-> POSTPROCESS
RUN_DIR -. "persistence root\nfor all artefacts" .-> EVALUATE
RUN_DIR -. "persistence root\nfor all artefacts" .-> DELIVER
Invariants and contracts¶
-
INPUT_DATA_DIRmust be set before invoking the CLI. The config helper raisesRuntimeErrorwhen it is unset (DESIGN.md Clause 6).features_dirdefaults to{INPUT_DATA_DIR}/features/andrun_dir_baseto{INPUT_DATA_DIR}/runs/. -
Features must be pre-built. Hindcast does not build features — that is the job of
cli run features. Preflight checksfit.parquetandpred.parquetexist before any compute starts (DESIGN.md Clause 31). -
_create_run_root(run_hindcast.py:77) is the single point that setsconfig.models_dirandconfig.preds_dir; all subsequent stages derive paths from those fields. -
Walk-forward folds use
ExpandingFoldGeneratorfromrun/experiment_protocol.py. Training data for foldkis strictlyyear < k— no future leakage. -
The production fold is always named
"production"and trains on all available data. It is accessed viaExperimentResult.production, not viaExperimentResult.hindcast_slices. -
_persist_includedis called before the walk-forward loop. The per-fold predict calls loadincluded_geo_identifiers.txtviaExperimentResult; the file must exist before the first fold attempts to score. -
Postprocess writes per-mode calibration parquets at
{run_dir}/conformal/{mode}.parquet(DESIGN.md artefact contract, Clause 34). These are the inputs to the forecast pipeline's conformal interval machinery. -
Area-weighted aggregation is mandatory throughout (DESIGN.md clause on unweighted means being forbidden).
aggregate_weighted_frameis the only legal ADM rollup.
Failure modes and recovery¶
| Symptom | Cause | Recovery |
|---|---|---|
SystemExit on fit.parquet / pred.parquet |
Features not yet built | Run make features EXPERIMENT_KEY=<key> or cli run features |
RuntimeError: INPUT_DATA_DIR not set |
Env var missing at config load | Export INPUT_DATA_DIR before invoking the CLI |
OperationalError from MLflow |
Two hindcast runs for the same commodity in parallel write to the same SQLite DB | Run same-commodity hindcast pipelines sequentially (project MEMORY.md) |
| Run fails mid-fold | Transient error (disk, network) after artefacts were written | Identify the last successful fold from preds/{key}/, delete partial artefacts for the failed fold, and rerun cli run hindcast |
KeyError in MedianImputer.transform at predict time |
feature_fill_values.parquet was written with a RangeIndex (pre-Clause-30 artefact) |
Delete the affected fold directory and rerun |
FileNotFoundError at postprocess on walk_forward_preds.parquet |
Walk-forward phase did not complete | Rerun the full hindcast; individual fold recovery requires deleting partial fold outputs |
| CI widths unexpectedly narrow | in_sample_pooled calibration mode active |
Switch forecast.residual_mode to an OOS mode and re-run cli run postprocess |
Cross-references¶
Entities:
- ExperimentConfig — carries all path, model, and protocol config
- RunDir — the on-disk persistence root; sole cross-stage contract
- ExperimentResult — aggregate root loaded from run_dir
- HindcastSlice — per-fold artefact handle
- CalibrationResult — conformal calibration sidecar
- HindcastDelivery — validated delivery document
Constituent pipelines: - preflight, feature_build, fit, predict, postprocess, evaluate, deliver
Concepts:
- walk_forward_cv — fold scheduling and expanding-window semantics
- experiment_protocol — ExpandingFoldGenerator and fold types
- conformal_modes — four residual-mode variants
- hindcast_vs_forecast — why hindcast and forecast are separate pipelines
- input_data_dir_contract — env var contract
Source pages:
- stages — run_hindcast.py role in the stage DAG
- orchestration — run/runner.py and run/experiment_protocol.py
- lib — ExperimentResult, HindcastSlice loading helpers
PRs that materially changed this stage¶
- PR #339 — nine-phase restructure that established the current
run_hindcast.runskeleton with_run_walk_forward_phaseand_run_production_fit_phaseas separate private helpers. - PR #345 (
tl/fix-path-issues) — addedAnyPath-based path resolution so S3 URIs work correctly throughout the orchestrator (DESIGN.md Clause 27). - PR #361 — introduced four
ResidualModevariants and per-mode conformal sidecar parquets; updatedpostprocess_experimentcall inrun_hindcast.run. - PR #369 (
f5399b96) — multi-year forecast support; restructured forecast paths fromforecast/{init_date}/toforecast/{season_year}/{init_date}/— touchedRunDirlayout andForecastSlicepath semantics, whichrun_hindcastconsumes indirectly viaExperimentResult.