commodity_hindcast — Overview¶

What it does¶

commodity_hindcast produces within-season yield forecasts for five agricultural commodities — US corn, US soybeans, US cotton, US wheat, and Brazilian soybeans — at three geographic admin levels (ADM0 country, ADM1 state, ADM2 county). The package operates in two modes that share a single artefact tree: hindcast mode runs a walk-forward cross-validation loop across past seasons, producing the audit-grade time series delivered to clients; forecast mode issues a point-in-time prediction for a specific (season_year, init_date) pair, reusing the production model fitted by the hindcast run. Every run is reproducible: configuration is validated atomically at load time, every artefact is written to a timestamped run_dir under INPUT_DATA_DIR/runs/, and MLflow tracks params and artefacts against each invocation.

Outputs are client-facing CSVs — three per run, one per ADM level — carrying predicted mean yield, conformal prediction intervals at five coverage levels (50/68/80/90/95), and benchmark columns sourced from NASS, WASDE, and CONAB. The schema is validated row-by-row by DeliveryRow on construction so corrupt deliveries are caught before they leave the pipeline.

How it does it (10-line tour)¶

Config load — ExperimentConfig (pydantic-settings) validates the full nested config atomically. All ResolvablePath fields resolve against INPUT_DATA_DIR. See ExperimentConfig.
Preflight — stage-specific preflight_paths_for_<stage>() checks abort the run before any compute if required inputs are absent. See Pipeline: preflight.
Features — five builder functions (yields, weather, climo, NDVI, stress) produce fit.parquet and pred.parquet keyed by (year, geo_identifier, init_date). See Pipeline: feature_build.
FIT (hindcast walk-forward) — an ExpandingFoldGenerator walks test_years; each fold fits a Detrender + Regressor and writes predictions to preds/{commodity}/{fold_label}/. The production fold is a separate no-holdout fit on all data. See Pipeline: fit.
PREDICT — fold models score pred.parquet to produce walk_forward_preds.parquet per fold. See Pipeline: predict.
POSTPROCESS — per-fold bias correctors (NASS coverage gap) and conformal calibration produce postprocessed/{commodity}_national.parquet and per-mode conformal/{mode}.parquet sidecars. See Pipeline: postprocess.
EVALUATE — metrics and plots are written to reports/ and logged as MLflow artefacts. See Pipeline: evaluate.
DELIVER — HindcastDelivery / DeliveryRow assemble the three client CSVs per ADM level, converting internal kg/ha to bu/ac or lbs/ac. See Pipeline: deliver.
FORECAST (sibling) — run_forecast.run() builds weather-spliced features, scores with the production model, postprocesses, and delivers to forecast/{season_year}/{init_date}/delivery/. See Pipeline: forecast.
Dashboard — a Streamlit app reads run-dir CSVs directly at startup; no API layer exists between the app and the artefact tree. See Pipeline: dashboard.

Architecture at a glance¶

flowchart LR
    CONFIG[EXPERIMENT_CONFIG\npydantic-settings root]
    PREFLIGHT[PREFLIGHT\nCheck list gate]
    FEATURES[FEATURE_BUILD\nfit + pred parquets]
    FIT[FIT\nDetrender + Regressor\nper fold]
    PREDICT[PREDICT\nwalk_forward_preds\nper fold]
    POSTPROCESS[POSTPROCESS\nbias_corrector + conformal\nnational.parquet]
    EVALUATE[EVALUATE\nmetrics + plots]
    DELIVER[DELIVER\nDeliveryRow CSVs]
    FORECAST[FORECAST\nweather splice\nper season_year init_date]
    RUNDIR[(RUN_DIR\nartefact tree)]

    CONFIG --> PREFLIGHT
    CONFIG --> FEATURES
    PREFLIGHT --> FEATURES
    FEATURES --> FIT
    FIT --> PREDICT
    PREDICT --> POSTPROCESS
    POSTPROCESS --> EVALUATE
    POSTPROCESS --> DELIVER
    RUNDIR -. read-only .-> FORECAST
    FORECAST --> RUNDIR
    FIT --> RUNDIR
    PREDICT --> RUNDIR
    POSTPROCESS --> RUNDIR
    DELIVER --> RUNDIR

The five aggregates¶

Aggregate root	Owned children	Persistence	Entity page
`ExperimentConfig`	`CommodityConfig`, `ModelConfig`, `ExperimentProtocolConfig`, `PostprocessConfig`, `DeliveryConfig`, `ForecastConfig` (opt), `list[ReferenceYieldSpec]`	`run_dir/config_resolved.yaml`	ExperimentConfig
`ExperimentResult`	`tuple[HindcastSlice, …]`, `tuple[ForecastSlice, …]`	Filesystem tree at `run_dir_base/<timestamp>_<name>/`	ExperimentResult
`HindcastSlice`	`detrender.pkl`, `model.*`, `feature_fill_values.parquet`, `train_preds.parquet`, `walk_forward_preds.parquet`, `year_data.parquet`, optional `bias_corrector.pkl`	`run_dir/models/{commodity}/{fold_label}/` and `run_dir/preds/{commodity}/{fold_label}/`	HindcastSlice
`ForecastSlice`	`indices.zarr`, `features/pred.parquet`, `walk_forward_preds.parquet`, `postprocessed/national.parquet`, `delivery/*.csv`	`run_dir/forecast/{season_year}/{init_date}/`	ForecastSlice
`HindcastDelivery`	`list[DeliveryRow]`, `generated_date`	`delivery/Treefera_{commodity}_{ADM}_Hindcast_{YYYYMMDD}.csv` (three files per run)	HindcastDelivery

Two further aggregates exist: ExperimentProtocolConfig + Fold schedule (the walk-forward CV schedule embedded in ExperimentConfig, persisted only via config_resolved.yaml) and Check list (ephemeral per-preflight-call gate, never persisted). See AGGREGATES.md.

The eleven bounded contexts¶

#	Context	Subpackage(s)	Public surface
1	Configuration & Orchestration	`config.py`, `cli.py`, `configs/*.yaml`, `lib/path_utils.py`, `lib/calendar.py`	`ExperimentConfig`, `CommodityConfig`, `cli` Click group
2	Preflight	`run/preflight.py`	`Check`, `run_preflight()`, `preflight_paths_for_*`
3	Feature Engineering	`features/`, `lib/edit_and_imputation/`, `lib/calendar.py`	`build_features()`, `Builder` protocol, `assemble()`
4	Experiment & Modelling	`run/`, `stages/run_fit.py`, `stages/run_hindcast.py`, `stages/run_predict.py`, `models/detrend/`, `models/regression/`, `lib/results/`	`ExperimentResult`, `HindcastSlice`, `train()`, `run()`
5	Post-processing	`stages/run_meta_models.py`, `models/meta_models/`	`postprocess_experiment()`, `AbstractBiasCorrector`, conformal helpers
6	Evaluation & Diagnostics	`stages/run_diagnostics.py`, `diagnostics/`	`evaluate_experiment()`, `PlotGroup`, `PLOT_REGISTRY`, `gen_metrics`
7	Delivery	`delivery/`, `stages/run_deliver.py`	`HindcastDelivery`, `DeliveryRow`, `deliver_experiment()`
8	Forecast	`stages/run_forecast.py`, `features/forecast_weather.py`, `features/forecast_long_range_stub.py`	`ForecastSlice`, `run()`, `materialise_forecast_indices()`
9	Experiment Tracking	`lib/tracking/`	MLflow run helpers, `metadata_<stage>.yaml` side-channel
10a	Reference Data	`lib/reference_data/`	`load_nass_panel()`, `load_wasde()`, `load_conab()`, `BaseReferenceYieldLoader`
10b	Geo & Identifiers	`lib/geo/`, `delivery/geo_normalise.py`	`GeoIdentifier` NewType, `make_geo_identifier()`, `area_weighted_mean()`
11	Dashboard	`app/`	`app.py` Streamlit entry point, `run_loader.py`

See BOUNDED_CONTEXTS.md for the Mermaid context map and full anti-corruption layer analysis.

What is distinctive about this package¶

Atomic config validation at load time. ExperimentConfig inherits pydantic_settings.BaseSettings and is the single config authority for every stage. All nested config blocks — commodity calendar, model selection, postprocess settings, delivery CI levels, forecast paths — are validated by chained pydantic model validators before any compute begins. Every ResolvablePath field anywhere in the nested tree is resolved against data_root (set from INPUT_DATA_DIR) by a single model_validator(mode="after"). A missing env var, an invalid CI level, or a mis-named ReferenceYieldSpec raises a RuntimeError or ValidationError at load time, never mid-run. This design choice is explained at length in AGGREGATES.md — the rationale for treating the whole config as one aggregate rather than seven.

ResolvablePath + AnyPath — transparent S3/local path handling. Every data path in the config is typed as ResolvablePath (a Pydantic custom type resolved at load time) and every downstream consumer uses cloudpathlib.AnyPath so that S3 URIs and local paths are handled identically. Wrapping an S3Path in a bare pathlib.Path() collapses the URI to a local-cache path and drops S3 semantics — the root cause of the QA failure in PR #345. See Concept: s3_path_safety and Concept: input_data_dir_contract.

Walk-forward CV that mirrors production scoring exactly. The hindcast loop uses ExpandingFoldGenerator to produce one fold per test year. Within each fold the model is fit once on harvest-time data and then reused — unchanged — at every init_date in the season; only the feature matrix changes. The "production" fold is a no-holdout fit on all available data. This means OOS hindcast accuracy, conformal calibration, and production forecasting all use the identical code path and artefact schema, so the confidence intervals reported on the delivery CSVs are directly grounded in walk-forward residuals. See Concept: walk_forward_cv.

Multi-mode conformal calibration. PR #361 introduced CalibrationResult and four residual_mode values: hindcast_oos_per_init_date, hindcast_oos_per_year, hindcast_oos_fully_pooled, and in_sample_pooled. The first element of PostprocessConfig.conformalise is the primary mode used for delivery interval bands; additional elements are saved as diagnostic sidecars under run_dir/conformal/{mode}.parquet. The _OOS_MODES frozenset in run_meta_models.py controls which modes require CV-fold artefacts to be present — it is hand-maintained rather than derived from the ResidualMode Literal, which is a known drift risk. See Concept: conformal_modes and Concept: residual_modes.

Multi-year forecasting per init_date. PR #369 extended the forecast pipeline so that a single init_date can produce forecasts for multiple season_year values. The artefact tree was restructured from forecast/{init_date}/ to forecast/{season_year}/{init_date}/ to avoid path collisions. For season_year values beyond the climatology zarr's coverage the long-range stub (forecast_long_range_stub.py) fills with trailing medians, causing the forecast to collapse to trend-only output — this is a documented stub, not a complete implementation. See Pipeline: multi_year_forecast.

Key reading paths¶

As a newcomer¶

This page
domain_model/BOUNDED_CONTEXTS.md
pipelines/forecast.md or pipelines/predict.md for depth
wiki/sources/docs/DESIGN.md
Then dig into specific entity pages

As an LLM ingesting a new source¶

wiki/AGENTS.md
domain_model/ENTITIES.md for the canonical name vocabulary
The relevant pipeline page
Append to index.md and log.md

As a maintainer making structural changes¶

domain_model/AGGREGATES.md
domain_model/BOUNDED_CONTEXTS.md
wiki/concepts/s3_path_safety.md if touching paths
wiki/concepts/input_data_dir_contract.md if touching configs

Open questions¶

marketing_year vs season_year — WASDE uses a marketing year (Oct–Sep for US grains) that is not fully collapsed into season_year. The Reference Data → Post-processing seam carries an implicit translation; surfaces in domain_model/BOUNDED_CONTEXTS.md open question 1.
Conformal helpers in stages/ vs lib/ — delivery/conversions.py imports conformal half-width computation from stages/run_meta_models.py, creating an upward layer edge from Delivery into Experiment orchestration. The correct home is lib/; surfaces in domain_model/BOUNDED_CONTEXTS.md open question 2.
included_geo_identifiers ownership — computed in the FIT stage, persisted to run_dir/included_geo_identifiers.txt, and consumed as a required kwarg by three downstream contexts. Whether it should be a first-class property on ExperimentResult is unresolved; surfaces in domain_model/BOUNDED_CONTEXTS.md open question 4.
_OOS_MODES frozenset drift risk — the set of residual modes that require CV folds is hand-maintained in stages/run_meta_models.py rather than derived from the ResidualMode Literal. Adding a new mode without updating this frozenset would produce a silent incorrect result.
Long-range stub completeness — beyond the climatology zarr's coverage, forecasts collapse to trend-only. This is acknowledged as a stub in PR #369 but not tracked as a formal open issue. Surfaces in pipelines/multi_year_forecast.md.
CalibrationResult persistence vs transience — AGGREGATES.md marks it transient; ENTITIES.md documents its save/load methods. The discrepancy is between a conservative hedge and what the source code actually does; surfaces in domain_model/delta_vs_existing.md.
build_detrender() / build_regressor() on ExperimentConfig — factory methods on the config class are a known mis-placement flagged by a TODO at config.py:719. Surfaces in entities/ExperimentConfig.md open questions.
Forecast vs hindcast static mode boundary — mode is determined solely by whether ExperimentConfig.forecast is set; a single run cannot produce both without run all. Surfaces in domain_model/BOUNDED_CONTEXTS.md open question 3.

Cross-references¶

domain_model/ — formal entity-relationship model (entities, aggregates, bounded contexts, ER diagrams)
wiki/AGENTS.md — maintenance schema; read before writing any wiki page
wiki/synthesis/thesis.md — editorial thesis on design choices and tensions
wiki/sources/ — immutable source pages (code, configs, docs, PRs, commits)
wiki/entities/ — domain entity pages
wiki/concepts/ — cross-cutting concept pages
wiki/pipelines/ — stage-by-stage pipeline walkthroughs