Pipeline: hindcast¶

Purpose¶

The hindcast pipeline is the end-to-end orchestrator that drives a full historical evaluation of a commodity model. It consumes pre-built feature parquets (fit.parquet and pred.parquet), then executes the full CV-plus-production sequence: each past season in config.experiment_protocol.test_years is held out in turn, the model is trained on all prior seasons, and rolling predictions are generated for every configured init_date in the held-out year. Once all CV folds are complete, a second fit trains on the entire available dataset (the "production" fold) to produce the model consumed by the forecast pipeline. The accumulated walk-forward predictions, together with the production model, are packaged into an ExperimentResult on disk, which is then post-processed (conformal CIs and bias correction), evaluated (metrics and diagnostic plots), and delivered as client-facing CSVs at ADM0, ADM1, and ADM2 resolution. The pipeline is invoked via cli run hindcast or the make hindcast Make target.

Inputs¶

Input	Path	Format	Producer
Training features	`{features_dir}/{experiment_key}/fit.parquet`	Parquet	`cli run features`
Scoring features	`{features_dir}/{experiment_key}/pred.parquet`	Parquet	`cli run features`
Resolved config	Set via `COMMODITY_HINDCAST_CONFIG` env var	YAML	Caller / Makefile
Reference yield series (WASDE, CONAB, …)	Paths declared in `config.reference_data[*].filepath`	Parquet / CSV	External data pipeline
Raw input data (NASS, weather, climo)	Paths declared as `ResolvablePath` fields on config	Parquet / Zarr / CSV	External sources

Outputs¶

All outputs land under a fresh timestamped run_dir: {config.run_dir_base}/{YYYYMMDD_HHMMSS}_{experiment_key}/

Output	Path	Format	Consumer
Resolved config snapshot	`run_dir/config_resolved.yaml`	YAML	Audit; `ExperimentResult.from_run_dir`
Fold model artefacts	`run_dir/models/{experiment_key}/{fold_label}/`	Pickle / Parquet / JSON	`run_predict`, forecast pipeline
Production model artefacts	`run_dir/models/{experiment_key}/production/`	Pickle / Parquet / JSON	Forecast pipeline
CV fold predictions	`run_dir/preds/{experiment_key}/{fold_label}/walk_forward_preds.parquet`	Parquet	`postprocess`, `deliver`, `evaluate`
Consolidated train predictions	`run_dir/preds/{experiment_key}/train_preds.parquet`	Parquet	`postprocess` (in-sample conformal)
Included county set	`run_dir/included_geo_identifiers.txt`	Text	`run_predict`, `postprocess`
Conformal calibration sidecars	`run_dir/conformal/{mode}.parquet`	Parquet	Forecast pipeline, delivery
Per-fold bias correctors	`run_dir/preds/{experiment_key}/{fold_label}/bias_corrector.pkl`	Pickle	`deliver`, `evaluate`
National postprocessed frame	`run_dir/postprocessed/national.parquet`	Parquet	`deliver`, dashboard
Metrics artefacts	`run_dir/reports/metrics_table.csv`, `stage5_metrics*.txt`	CSV / Text	Stakeholder review
Diagnostic plots	`run_dir/reports/.png`, `run_dir/reports/hindcast/.png`	PNG	Stakeholder review
Delivery CSVs	`run_dir/delivery/Treefera_{key}_{ADM}_Hindcast_{YYYYMMDD}.csv`	CSV	QUBE / client

Step-by-step flow¶

Config resolution — _prepare_config(config_path) (run_hindcast.py:64) resolves the config path, sets COMMODITY_HINDCAST_CONFIG in the environment, and constructs the ExperimentConfig via Pydantic.
Preflight — run_preflight(preflight_paths_for_hindcast(config)) (run_hindcast.py:201) checks that fit.parquet and pred.parquet exist plus all ResolvablePath fields declared on the config. Halts with SystemExit on the first missing path. See pipeline: preflight for detail.
Run root creation — _create_run_root(config) (run_hindcast.py:77) creates the timestamped directory, rewrites config.models_dir and config.preds_dir into it, and writes config_resolved.yaml.
MLflow context — prepare_hindcast_mlflow and hindcast_mlflow_run open a tracking context that logs params and artefacts for the entire run.
Load and preprocess — _load_and_preprocess(config) (run_hindcast.py:101) reads fit.parquet and pred.parquet, calls preprocess_data (column guard and state-name enrichment), and applies select_by_production to restrict both frames to the top counties by recent production. Reference yield series are loaded via build_references_by_harvest_year.
Persist included counties — _persist_included(fit_data, run_root) (run_hindcast.py:154) derives the frozenset[str] of included geo_identifier values and writes it to included_geo_identifiers.txt; subsequent per-fold predict calls read this file via ExperimentResult.load_included_geo_identifiers.
Walk-forward CV phase — _run_walk_forward_phase(config, fit_data, pred_data, references_by_harvest) (run_hindcast.py:161):
ExpandingFoldGenerator (run/experiment_protocol.py:110) yields one (fold_label, train_data, test_data, year_data, references_fold) tuple per entry in config.experiment_protocol.test_years, with train_data = fit_df[year < test_year].
run_walk_forward(config, data_fold_generator) (run/runner.py:27) iterates folds. For each fold, run_experiment(fold_label, train_data, config) trains the detrender + imputer + regressor (run_fit.train) and writes train_preds.parquet. _predict_fold_rolling then sweeps every init_date in memory and writes walk_forward_preds.parquet once per fold (blind-overwrite safe because all init_date values are accumulated before the single write).
Per-fold train_preds.parquet files are concatenated into a single preds/{key}/train_preds.parquet.
Production fit phase — _run_production_fit_phase(config, fit_data) (run_hindcast.py:187) calls run_experiment(fold_label="production", train_data=fit_data, config=config), training on all available data with no holdout.
Postprocess — postprocess_experiment(run_root, included_geo_identifiers=included) (stages/run_meta_models.py:138). Fits conformal calibration sidecars for every mode in config.postprocess.conformalise, fits and persists per-fold bias correctors, and writes postprocessed/national.parquet. See pipeline: postprocess.
Evaluate — evaluate_experiment(run_root) (stages/run_diagnostics.py:12). Computes per-fold OOS metrics, writes metrics_table.csv and stage5_metrics*.txt, generates diagnostic PNGs. See pipeline: evaluate.
Deliver — deliver_experiment(run_root) (stages/run_deliver.py:40). Aggregates CV fold predictions to ADM0/ADM1/ADM2, converts units (kg/ha → bu/ac), attaches CI bands, validates rows via Pydantic, and writes three delivery CSVs. See pipeline: deliver.
Return — run.run returns run_root so the caller (Makefile or CLI) can chain cli run forecast --run-dir {run_root} against the same directory.

Mermaid flow diagram¶

flowchart LR
    INPUT["fit.parquet\npred.parquet\n(features_dir)"]
    PREFLIGHT["PREFLIGHT\npreflight_paths_for_hindcast\nrun_hindcast.py:201"]
    RUN_DIR["_create_run_root\nrun_hindcast.py:77\n→ RUN_DIR"]
    PREPROC["_load_and_preprocess\npreprocess_data +\nselect_by_production"]

    subgraph WF["Walk-forward CV loop\n_run_walk_forward_phase"]
        direction TB
        FG["ExpandingFoldGenerator\nexperiment_protocol.py:110"]
        FOLD_FIT["run_experiment\n→ train()  +  train_preds.parquet"]
        FOLD_PRED["_predict_fold_rolling\n→ walk_forward_preds.parquet"]
        FG --> FOLD_FIT --> FOLD_PRED
    end

    PROD_FIT["PRODUCTION FIT\n_run_production_fit_phase\nfold_label='production'"]
    POSTPROCESS["POSTPROCESS\npostprocess_experiment\n→ conformal/ + national.parquet"]
    EVALUATE["EVALUATE\nevaluate_experiment\n→ reports/"]
    DELIVER["DELIVER\ndeliver_experiment\n→ delivery/Treefera_*.csv"]

    INPUT --> PREFLIGHT --> RUN_DIR --> PREPROC --> WF --> PROD_FIT --> POSTPROCESS --> EVALUATE --> DELIVER

    RUN_DIR -. "persistence root\nfor all artefacts" .-> POSTPROCESS
    RUN_DIR -. "persistence root\nfor all artefacts" .-> EVALUATE
    RUN_DIR -. "persistence root\nfor all artefacts" .-> DELIVER

Invariants and contracts¶

INPUT_DATA_DIR must be set before invoking the CLI. The config helper raises RuntimeError when it is unset (DESIGN.md Clause 6). features_dir defaults to {INPUT_DATA_DIR}/features/ and run_dir_base to {INPUT_DATA_DIR}/runs/.
Features must be pre-built. Hindcast does not build features — that is the job of cli run features. Preflight checks fit.parquet and pred.parquet exist before any compute starts (DESIGN.md Clause 31).
_create_run_root (run_hindcast.py:77) is the single point that sets config.models_dir and config.preds_dir; all subsequent stages derive paths from those fields.
Walk-forward folds use ExpandingFoldGenerator from run/experiment_protocol.py. Training data for fold k is strictly year < k — no future leakage.
The production fold is always named "production" and trains on all available data. It is accessed via ExperimentResult.production, not via ExperimentResult.hindcast_slices.
_persist_included is called before the walk-forward loop. The per-fold predict calls load included_geo_identifiers.txt via ExperimentResult; the file must exist before the first fold attempts to score.
Postprocess writes per-mode calibration parquets at {run_dir}/conformal/{mode}.parquet (DESIGN.md artefact contract, Clause 34). These are the inputs to the forecast pipeline's conformal interval machinery.
Area-weighted aggregation is mandatory throughout (DESIGN.md clause on unweighted means being forbidden). aggregate_weighted_frame is the only legal ADM rollup.

Failure modes and recovery¶

Symptom	Cause	Recovery
`SystemExit` on `fit.parquet` / `pred.parquet`	Features not yet built	Run `make features EXPERIMENT_KEY=<key>` or `cli run features`
`RuntimeError: INPUT_DATA_DIR not set`	Env var missing at config load	Export `INPUT_DATA_DIR` before invoking the CLI
`OperationalError` from MLflow	Two hindcast runs for the same commodity in parallel write to the same SQLite DB	Run same-commodity hindcast pipelines sequentially (project MEMORY.md)
Run fails mid-fold	Transient error (disk, network) after artefacts were written	Identify the last successful fold from `preds/{key}/`, delete partial artefacts for the failed fold, and rerun `cli run hindcast`
`KeyError` in `MedianImputer.transform` at predict time	`feature_fill_values.parquet` was written with a `RangeIndex` (pre-Clause-30 artefact)	Delete the affected fold directory and rerun
`FileNotFoundError` at postprocess on `walk_forward_preds.parquet`	Walk-forward phase did not complete	Rerun the full hindcast; individual fold recovery requires deleting partial fold outputs
CI widths unexpectedly narrow	`in_sample_pooled` calibration mode active	Switch `forecast.residual_mode` to an OOS mode and re-run `cli run postprocess`

Cross-references¶

Entities: - ExperimentConfig — carries all path, model, and protocol config - RunDir — the on-disk persistence root; sole cross-stage contract - ExperimentResult — aggregate root loaded from run_dir - HindcastSlice — per-fold artefact handle - CalibrationResult — conformal calibration sidecar - HindcastDelivery — validated delivery document

Constituent pipelines: - preflight, feature_build, fit, predict, postprocess, evaluate, deliver

Concepts: - walk_forward_cv — fold scheduling and expanding-window semantics - experiment_protocol — ExpandingFoldGenerator and fold types - conformal_modes — four residual-mode variants - hindcast_vs_forecast — why hindcast and forecast are separate pipelines - input_data_dir_contract — env var contract

Source pages: - stages — run_hindcast.py role in the stage DAG - orchestration — run/runner.py and run/experiment_protocol.py - lib — ExperimentResult, HindcastSlice loading helpers

PRs that materially changed this stage¶

PR #339 — nine-phase restructure that established the current run_hindcast.run skeleton with _run_walk_forward_phase and _run_production_fit_phase as separate private helpers.
PR #345 (tl/fix-path-issues) — added AnyPath-based path resolution so S3 URIs work correctly throughout the orchestrator (DESIGN.md Clause 27).
PR #361 — introduced four ResidualMode variants and per-mode conformal sidecar parquets; updated postprocess_experiment call in run_hindcast.run.
PR #369 (f5399b96) — multi-year forecast support; restructured forecast paths from forecast/{init_date}/ to forecast/{season_year}/{init_date}/ — touched RunDir layout and ForecastSlice path semantics, which run_hindcast consumes indirectly via ExperimentResult.