Skip to content

Pipeline: hindcast

Purpose

The hindcast pipeline is the end-to-end orchestrator that drives a full historical evaluation of a commodity model. It consumes pre-built feature parquets (fit.parquet and pred.parquet), then executes the full CV-plus-production sequence: each past season in config.experiment_protocol.test_years is held out in turn, the model is trained on all prior seasons, and rolling predictions are generated for every configured init_date in the held-out year. Once all CV folds are complete, a second fit trains on the entire available dataset (the "production" fold) to produce the model consumed by the forecast pipeline. The accumulated walk-forward predictions, together with the production model, are packaged into an ExperimentResult on disk, which is then post-processed (conformal CIs and bias correction), evaluated (metrics and diagnostic plots), and delivered as client-facing CSVs at ADM0, ADM1, and ADM2 resolution. The pipeline is invoked via cli run hindcast or the make hindcast Make target.

Inputs

Input Path Format Producer
Training features {features_dir}/{experiment_key}/fit.parquet Parquet cli run features
Scoring features {features_dir}/{experiment_key}/pred.parquet Parquet cli run features
Resolved config Set via COMMODITY_HINDCAST_CONFIG env var YAML Caller / Makefile
Reference yield series (WASDE, CONAB, …) Paths declared in config.reference_data[*].filepath Parquet / CSV External data pipeline
Raw input data (NASS, weather, climo) Paths declared as ResolvablePath fields on config Parquet / Zarr / CSV External sources

Outputs

All outputs land under a fresh timestamped run_dir: {config.run_dir_base}/{YYYYMMDD_HHMMSS}_{experiment_key}/

Output Path Format Consumer
Resolved config snapshot run_dir/config_resolved.yaml YAML Audit; ExperimentResult.from_run_dir
Fold model artefacts run_dir/models/{experiment_key}/{fold_label}/ Pickle / Parquet / JSON run_predict, forecast pipeline
Production model artefacts run_dir/models/{experiment_key}/production/ Pickle / Parquet / JSON Forecast pipeline
CV fold predictions run_dir/preds/{experiment_key}/{fold_label}/walk_forward_preds.parquet Parquet postprocess, deliver, evaluate
Consolidated train predictions run_dir/preds/{experiment_key}/train_preds.parquet Parquet postprocess (in-sample conformal)
Included county set run_dir/included_geo_identifiers.txt Text run_predict, postprocess
Conformal calibration sidecars run_dir/conformal/{mode}.parquet Parquet Forecast pipeline, delivery
Per-fold bias correctors run_dir/preds/{experiment_key}/{fold_label}/bias_corrector.pkl Pickle deliver, evaluate
National postprocessed frame run_dir/postprocessed/national.parquet Parquet deliver, dashboard
Metrics artefacts run_dir/reports/metrics_table.csv, stage5_metrics*.txt CSV / Text Stakeholder review
Diagnostic plots run_dir/reports/*.png, run_dir/reports/hindcast/*.png PNG Stakeholder review
Delivery CSVs run_dir/delivery/Treefera_{key}_{ADM}_Hindcast_{YYYYMMDD}.csv CSV QUBE / client

Step-by-step flow

  1. Config resolution_prepare_config(config_path) (run_hindcast.py:64) resolves the config path, sets COMMODITY_HINDCAST_CONFIG in the environment, and constructs the ExperimentConfig via Pydantic.

  2. Preflightrun_preflight(preflight_paths_for_hindcast(config)) (run_hindcast.py:201) checks that fit.parquet and pred.parquet exist plus all ResolvablePath fields declared on the config. Halts with SystemExit on the first missing path. See pipeline: preflight for detail.

  3. Run root creation_create_run_root(config) (run_hindcast.py:77) creates the timestamped directory, rewrites config.models_dir and config.preds_dir into it, and writes config_resolved.yaml.

  4. MLflow contextprepare_hindcast_mlflow and hindcast_mlflow_run open a tracking context that logs params and artefacts for the entire run.

  5. Load and preprocess_load_and_preprocess(config) (run_hindcast.py:101) reads fit.parquet and pred.parquet, calls preprocess_data (column guard and state-name enrichment), and applies select_by_production to restrict both frames to the top counties by recent production. Reference yield series are loaded via build_references_by_harvest_year.

  6. Persist included counties_persist_included(fit_data, run_root) (run_hindcast.py:154) derives the frozenset[str] of included geo_identifier values and writes it to included_geo_identifiers.txt; subsequent per-fold predict calls read this file via ExperimentResult.load_included_geo_identifiers.

  7. Walk-forward CV phase_run_walk_forward_phase(config, fit_data, pred_data, references_by_harvest) (run_hindcast.py:161):

  8. ExpandingFoldGenerator (run/experiment_protocol.py:110) yields one (fold_label, train_data, test_data, year_data, references_fold) tuple per entry in config.experiment_protocol.test_years, with train_data = fit_df[year < test_year].
  9. run_walk_forward(config, data_fold_generator) (run/runner.py:27) iterates folds. For each fold, run_experiment(fold_label, train_data, config) trains the detrender + imputer + regressor (run_fit.train) and writes train_preds.parquet. _predict_fold_rolling then sweeps every init_date in memory and writes walk_forward_preds.parquet once per fold (blind-overwrite safe because all init_date values are accumulated before the single write).
  10. Per-fold train_preds.parquet files are concatenated into a single preds/{key}/train_preds.parquet.

  11. Production fit phase_run_production_fit_phase(config, fit_data) (run_hindcast.py:187) calls run_experiment(fold_label="production", train_data=fit_data, config=config), training on all available data with no holdout.

  12. Postprocesspostprocess_experiment(run_root, included_geo_identifiers=included) (stages/run_meta_models.py:138). Fits conformal calibration sidecars for every mode in config.postprocess.conformalise, fits and persists per-fold bias correctors, and writes postprocessed/national.parquet. See pipeline: postprocess.

  13. Evaluateevaluate_experiment(run_root) (stages/run_diagnostics.py:12). Computes per-fold OOS metrics, writes metrics_table.csv and stage5_metrics*.txt, generates diagnostic PNGs. See pipeline: evaluate.

  14. Deliverdeliver_experiment(run_root) (stages/run_deliver.py:40). Aggregates CV fold predictions to ADM0/ADM1/ADM2, converts units (kg/ha → bu/ac), attaches CI bands, validates rows via Pydantic, and writes three delivery CSVs. See pipeline: deliver.

  15. Returnrun.run returns run_root so the caller (Makefile or CLI) can chain cli run forecast --run-dir {run_root} against the same directory.

Mermaid flow diagram

flowchart LR
    INPUT["fit.parquet\npred.parquet\n(features_dir)"]
    PREFLIGHT["PREFLIGHT\npreflight_paths_for_hindcast\nrun_hindcast.py:201"]
    RUN_DIR["_create_run_root\nrun_hindcast.py:77\n→ RUN_DIR"]
    PREPROC["_load_and_preprocess\npreprocess_data +\nselect_by_production"]

    subgraph WF["Walk-forward CV loop\n_run_walk_forward_phase"]
        direction TB
        FG["ExpandingFoldGenerator\nexperiment_protocol.py:110"]
        FOLD_FIT["run_experiment\n→ train()  +  train_preds.parquet"]
        FOLD_PRED["_predict_fold_rolling\n→ walk_forward_preds.parquet"]
        FG --> FOLD_FIT --> FOLD_PRED
    end

    PROD_FIT["PRODUCTION FIT\n_run_production_fit_phase\nfold_label='production'"]
    POSTPROCESS["POSTPROCESS\npostprocess_experiment\n→ conformal/ + national.parquet"]
    EVALUATE["EVALUATE\nevaluate_experiment\n→ reports/"]
    DELIVER["DELIVER\ndeliver_experiment\n→ delivery/Treefera_*.csv"]

    INPUT --> PREFLIGHT --> RUN_DIR --> PREPROC --> WF --> PROD_FIT --> POSTPROCESS --> EVALUATE --> DELIVER

    RUN_DIR -. "persistence root\nfor all artefacts" .-> POSTPROCESS
    RUN_DIR -. "persistence root\nfor all artefacts" .-> EVALUATE
    RUN_DIR -. "persistence root\nfor all artefacts" .-> DELIVER

Invariants and contracts

  • INPUT_DATA_DIR must be set before invoking the CLI. The config helper raises RuntimeError when it is unset (DESIGN.md Clause 6). features_dir defaults to {INPUT_DATA_DIR}/features/ and run_dir_base to {INPUT_DATA_DIR}/runs/.

  • Features must be pre-built. Hindcast does not build features — that is the job of cli run features. Preflight checks fit.parquet and pred.parquet exist before any compute starts (DESIGN.md Clause 31).

  • _create_run_root (run_hindcast.py:77) is the single point that sets config.models_dir and config.preds_dir; all subsequent stages derive paths from those fields.

  • Walk-forward folds use ExpandingFoldGenerator from run/experiment_protocol.py. Training data for fold k is strictly year < k — no future leakage.

  • The production fold is always named "production" and trains on all available data. It is accessed via ExperimentResult.production, not via ExperimentResult.hindcast_slices.

  • _persist_included is called before the walk-forward loop. The per-fold predict calls load included_geo_identifiers.txt via ExperimentResult; the file must exist before the first fold attempts to score.

  • Postprocess writes per-mode calibration parquets at {run_dir}/conformal/{mode}.parquet (DESIGN.md artefact contract, Clause 34). These are the inputs to the forecast pipeline's conformal interval machinery.

  • Area-weighted aggregation is mandatory throughout (DESIGN.md clause on unweighted means being forbidden). aggregate_weighted_frame is the only legal ADM rollup.

Failure modes and recovery

Symptom Cause Recovery
SystemExit on fit.parquet / pred.parquet Features not yet built Run make features EXPERIMENT_KEY=<key> or cli run features
RuntimeError: INPUT_DATA_DIR not set Env var missing at config load Export INPUT_DATA_DIR before invoking the CLI
OperationalError from MLflow Two hindcast runs for the same commodity in parallel write to the same SQLite DB Run same-commodity hindcast pipelines sequentially (project MEMORY.md)
Run fails mid-fold Transient error (disk, network) after artefacts were written Identify the last successful fold from preds/{key}/, delete partial artefacts for the failed fold, and rerun cli run hindcast
KeyError in MedianImputer.transform at predict time feature_fill_values.parquet was written with a RangeIndex (pre-Clause-30 artefact) Delete the affected fold directory and rerun
FileNotFoundError at postprocess on walk_forward_preds.parquet Walk-forward phase did not complete Rerun the full hindcast; individual fold recovery requires deleting partial fold outputs
CI widths unexpectedly narrow in_sample_pooled calibration mode active Switch forecast.residual_mode to an OOS mode and re-run cli run postprocess

Cross-references

Entities: - ExperimentConfig — carries all path, model, and protocol config - RunDir — the on-disk persistence root; sole cross-stage contract - ExperimentResult — aggregate root loaded from run_dir - HindcastSlice — per-fold artefact handle - CalibrationResult — conformal calibration sidecar - HindcastDelivery — validated delivery document

Constituent pipelines: - preflight, feature_build, fit, predict, postprocess, evaluate, deliver

Concepts: - walk_forward_cv — fold scheduling and expanding-window semantics - experiment_protocolExpandingFoldGenerator and fold types - conformal_modes — four residual-mode variants - hindcast_vs_forecast — why hindcast and forecast are separate pipelines - input_data_dir_contract — env var contract

Source pages: - stagesrun_hindcast.py role in the stage DAG - orchestrationrun/runner.py and run/experiment_protocol.py - libExperimentResult, HindcastSlice loading helpers

PRs that materially changed this stage

  • PR #339 — nine-phase restructure that established the current run_hindcast.run skeleton with _run_walk_forward_phase and _run_production_fit_phase as separate private helpers.
  • PR #345 (tl/fix-path-issues) — added AnyPath-based path resolution so S3 URIs work correctly throughout the orchestrator (DESIGN.md Clause 27).
  • PR #361 — introduced four ResidualMode variants and per-mode conformal sidecar parquets; updated postprocess_experiment call in run_hindcast.run.
  • PR #369 (f5399b96) — multi-year forecast support; restructured forecast paths from forecast/{init_date}/ to forecast/{season_year}/{init_date}/ — touched RunDir layout and ForecastSlice path semantics, which run_hindcast consumes indirectly via ExperimentResult.