Skip to content

commodity_hindcast Wiki — Index

Auto-regenerated after P6; updated after P7 lint pass. Total pages: 109.

Plan

Schema

  • AGENTS — maintenance schema; read before reading or writing any wiki page
  • log — append-only chronological record of ingests, queries, and lint passes

Synthesis

  • overview — Top-down overview of the commodity_hindcast package — what it does, how it is structured, where the seams are
  • thesis — Editorial thesis — what commodity_hindcast is good at, its central tensions, and the architectural facts that matter most

Domain model

  • Domain model README — entry point and reading order for the commodity_hindcast formal domain model
  • ENTITIES — canonical entity catalogue for the commodity_hindcast domain (~30 entities, 5 tiers)
  • RELATIONSHIPS — inter-entity relationships in the commodity_hindcast domain (~70 relationships)
  • AGGREGATES — domain aggregates and consistency boundaries in commodity_hindcast
  • BOUNDED_CONTEXTS — DDD bounded contexts of the commodity_hindcast domain (~11 contexts + Mermaid context map)
  • ER_DIAGRAM — Mermaid entity-relationship diagrams for the commodity_hindcast domain (three diagrams)
  • delta_vs_existing — relation between commodity_hindcast_kb/domain_model and the in-package domain-modelling/

Entities

Tier 1 — Core domain

  • Commodity — root discriminator for a crop being modelled — drives calendar, feature columns, yield units, and plausibility bounds
  • SeasonYear — integer harvest-year label that anchors a crop season; primary grouping key across feature parquets, fold labels, and delivery CSVs
  • InitDate — calendar date on which a within-season forecast is anchored; features are known up to init_date minus lag_days
  • Region — geographic administrative unit at ADM0/ADM1/ADM2 level, identified by a canonical GeoIdentifier string
  • Yield — scalar crop yield measurement or prediction; internally always kg/ha, converted to delivery units only at the delivery boundary
  • Fold — walk-forward CV fold identified by fold_label string; numeric labels encode a season year test cutoff, "production" encodes the no-holdout fit

Tier 2 — Configuration aggregates

  • ExperimentConfig — root pydantic-settings aggregate — resolves all paths, holds every subordinate config block, and is passed as the single config argument to every pipeline stage
  • CommodityConfig — commodity-specific constants — crop calendar, builder registry, feature/target column names, unit weights, and plausibility bounds
  • ModelConfig — selects the detrend strategy and regression estimator for the FIT stage, plus sample-weight and fit-aggregation settings
  • FeatureBuilderConfig — discriminated-union config for the five feature data-source readers — YieldsBuilder, WeatherBuilder, ClimoBuilder, NDVIBuilder, StressBuilder
  • ReferenceYieldSpec — discriminated union over external reference-yield loader specifications — WasdeRefSpec, ConabFinalRefSpec, ConabLevantamentoRefSpec
  • BiasCorrectorConfig — configures the national-scale residual bias corrector fitted during the POSTPROCESS stage
  • PostprocessConfig — configures the POSTPROCESS stage — bias correction and conformal calibration mode selection
  • EditRuleConfig — discriminated union of Fellegi-Holt edit rules applied to raw survey data before feature assembly — detection rules and corrective operations
  • ExperimentProtocolConfig — walk-forward CV schedule — test years, expanding-window strategy, and production-fold county-selection thresholds
  • ForecastConfig — forecast-time paths, mandatory residual_mode, and runtime-injected init_date — present only in forecast mode
  • DeliveryConfig — delivery-phase configuration controlling client-facing CSV output — model name, CI bands, and post-transform flags for CI narrowing and frozen-tail removal

Tier 3 — Pipeline artefacts

  • RunDir — the on-disk persistence root for one experiment run — timestamped directory under INPUT_DATA_DIR/runs/ holding every pipeline artefact, config snapshot, and delivery CSV
  • ExperimentResult — frozen dataclass aggregate root for all hindcast artefacts under one run_dir — lazy handle to config, HindcastSlices, and ForecastSlices
  • HindcastSlice — lazy per-fold artefact handle — frozen dataclass exposing paths and loaders for one walk-forward CV fold's detrender, regressor, fill values, and prediction parquets
  • ForecastSlice — lazy per-(season_year, init_date) artefact handle — frozen dataclass for one in-season forecast, with paths isolated under run_dir/forecast/{season_year}/{init_date}/
  • CalibrationResult — frozen, persistable dataclass holding fitted conformal half-widths — save/load via long-format parquet sidecars; exposes predict_interval for forecast and delivery consumers

Tier 4 — Behavioural roles

  • FeatureBuilder — BuilderFn Protocol — the callable contract every feature builder must satisfy; dispatched by string key from builders/registry.py
  • Detrender — AbstractDetrend ABC — three concrete yield-trend detrenders (LinearState, GaussianWindow, PartialPooling) with a shared TREND_AXIS convention (epoch 1980-01-01, unit year)
  • Regressor — AbstractRegressionImpl ABC — three concrete residual regressors (Ridge, PCA-Ridge, XGBoost) operating on detrended features; per-regressor persistence formats documented
  • MetaModel — post-processing role layer covering the BiasCorrector hierarchy and conformal calibration (CalibrationResult, MapieConformalRegressor)
  • ReferenceYieldLoader — BaseReferenceYieldLoader ABC — four concrete loaders (WasdeLoader, ConabFinalLoader, ConabLevantamentoLoader, plus NASS) dispatched via ReferenceYieldSpec discriminated union

Tier 5 — Delivery and validation

  • HindcastDelivery — top-level delivery container for one commodity × ADM level × generated date; enforces no-duplicate-key and equal-init-date-count invariants across all rows
  • DeliveryRow — single row in a hindcast/forecast delivery CSV; 17 typed fields enforcing CI ordering and init-date consistency via Pydantic validators
  • Check — frozen preflight value object; name + passed + message + critical; run_preflight() raises SystemExit on any critical failure
  • SeasonWindow — frozen dataclass on CommodityConfig defining a named temporal aggregation window in season-DOY coordinates — consumed by climo and weather builders

Concepts

  • adm_levels — the ADM0/ADM1/ADM2 geographic hierarchy — how geo_identifier encodes level, make_geo_identifier construction, and delivery aggregation from county to national
  • bias_correction — national-scale coverage bias correction — AbstractBiasCorrector hierarchy, CoverageBiasCorrector formula, NoBiasCorrector pass-through, and per-fold persistence
  • conformal_calibration — redirect to conformal_modes; quick-reference for CalibrationResult save/load symbols
  • resolvable_path — the ResolvablePath type alias — an Annotated AnyPath that auto-resolves relative paths against data_root and drives automatic preflight coverage
  • climo_materialisation — how climatology is materialised for forecast indices — materialise_forecast_indices, materialised_climo_filepath, and the long-range stub for years beyond zarr coverage
  • conformal_modes — the four residual modes for conformal prediction intervals — recipes, dispatch, per-mode parquet sidecar layout, and calibration-set size vs bias trade-offs
  • hindcast_vs_forecast — what hindcast and forecast are conceptually, how their pipelines differ, and where the read-only boundary between them is enforced
  • input_data_dir_contract — how INPUT_DATA_DIR is the sole data-root resolver — DESIGN.md Clause 6, require_input_data_dir(), and the per-pipeline values
  • mlflow_tracking — how the pipeline logs runs, params, and artefacts to MLflow — decorators, SQLite backend, artefact tagging, and the parallel-run DB-locking issue
  • residual_modes — the forecast.residual_mode mandatory field — four invariant strings, the validate_residual_mode gate, and the no-backwards-compat architecture decision
  • s3_path_safety — how the pipeline handles S3 URIs safely via AnyPath, ResolvablePath, and AnyPathParam — and the three-layer bug stack that PR-345 fixed
  • walk_forward_cv — walk-forward cross-validation fold strategy — expanding window, ExpandingFoldGenerator, production fit, and how CV integrates with conformal calibration
  • weather_correction — what weather_correction_bu_ac is — the detrended component of the model yield in delivery units — and why the structural identity beats a vintage delta

Pipelines

In execution order:

  • preflight — path-existence gates that abort a run at config-load time if required input artefacts are missing — five check-sets, one per pipeline stage
  • feature_build — orchestrates the full feature engineering run — iterates configured builders, writes intermediate parquets, then calls assemble to produce fit.parquet, pred.parquet, and metadata.json
  • fit — trains detrender then regressor on one fold's training DataFrame; persists detrender.pkl, feature_fill_values.parquet, and the regressor file(s); returns a HindcastSlice handle
  • predict — inference kernel for one (season_year, init_date) pair — four-step inverse pipeline (detrend → score → weather-correct → retrend) split into pure-compute predict() and pure-persistence write_walk_forward_outputs()
  • postprocess — aggregates walk-forward fold predictions to national level, fits bias corrector and conformal calibration per configured mode, and writes per-mode sidecar parquets plus postprocessed/national.parquet
  • evaluate — compute per-fold metrics, write text reports and CSV, generate 15 diagnostic PNG plots from a completed hindcast run_dir
  • deliver — convert CV-fold walk_forward_preds to QUBE-format ADM0/ADM1/ADM2 CSV files with unit conversion, CI bands, and Pydantic row validation
  • forecast — per-(season_year, init_date) orchestrator composing feature build → predict → postprocess → deliver against an already-fitted run_dir
  • multi_year_forecast — multi-season_year forecast — forecasting multiple season_years from one init_date, the long-range climo stub, panel trailing-median imputation, and why output collapses to trend-only beyond zarr coverage
  • dashboard — Streamlit hindcast dashboard — run discovery via RunDescriptor, COMMODITY_CONFIG, window-aware MAPE scoring, and six interactive chart sections; reads RunDir artefacts directly with no API layer

Sources

Code subsystems

  • index — index of all code subsystem source pages
  • dashboard — Streamlit dashboard that reads RunDir artefacts directly and renders forecast evolution, accuracy, and WASDE-comparison charts for every commodity hindcast run on disk
  • delivery — client-facing CSV delivery subsystem — Pydantic row schemas, wide-to-long conversion, ADM aggregation, unit conversion, and S3/local export pipeline
  • detrend — detrending subsystem — three state-space detrenders (linear, Gaussian, partial-pooling) sharing a common abstract base and a single trend-axis convention
  • diagnostics — metrics computation, per-fold scoring, and rolling national vs NASS/WASDE/CONAB text reports; ADM1/ADM2 error tables; bu/ac unit conversion boundary
  • features — feature engineering subsystem — orchestrator, assembler, forecast splicing, long-range stub, and all concrete builders (yields, weather, climo, NDVI, stress)
  • lib — survey of all ~25 modules under lib/ — path anchoring, calendar, unit conversion, geo utilities, reference-data loaders, edit-and-imputation, artefact handles, and MLflow tracking helpers
  • meta_models — meta-models post-processing subsystem — BiasCorrector hierarchy, CalibrationResult (conformal calibration with save/load), four residual modes, and the run_meta_models stage orchestrator
  • orchestration — orchestration and configuration subsystem — CLI entry, Pydantic config, walk-forward runner, preflight gates, fold generation
  • plots — diagnostic plots subsystem — PlotRunner I/O orchestration, PlotRegistry/PlotSpec declarative discovery, prep modules, and all 11 plot-function modules
  • regression — regression models subsystem — AbstractRegressionImpl base contract, Ridge, PCA+Ridge, and XGBoost regressors, persistence formats, and runtime helpers
  • stages — stage modules under commodity_hindcast/stages/ — function signatures, artefact contracts, inner call sequences, and orchestrator flow for hindcast and forecast pipelines

Configs

  • index — index of all config source pages for commodity_hindcast — experiment YAMLs, ty.toml, LinkML schema
  • corn_usa — Corn (USA) experiment config — Apr–Oct season, PCA-Ridge with stress features, WASDE reference
  • cotton_usa — Cotton (upland, USA) experiment config — Apr–Nov season, lbs/acre delivery, no stress builder
  • soybeans_bra — Brazil soybean (soja) experiment config — southern-hemisphere cross-year season, IBGE yields, CONAB reference
  • soybeans_usa — Soybean (USA) experiment config — May–Oct season, 60 lb bushel, no stress builder, local zarr paths
  • wheat_usa — Winter wheat (USA) experiment config — cross-year Oct–Jul season, stress builder, 4-phase ramp
  • ty_toml — ty type-checker configuration for commodity_hindcast — suppresses pydantic-settings false positives
  • domain_schema — LinkML auto-generated schema for commodity_hindcast — structural class/enum catalogue from config.py

Docs

  • index — catalogue of all source pages derived from in-package documentation files
  • README — primary user guide — pipeline diagram, CLI reference, data layout, Make targets, MLflow wiring, project layout
  • DESIGN — EARS-format pipeline contract — 35 design decisions covering config, paths, stages, MLflow, code style, features, predictions, delivery, and forecasting
  • TODO — backlog of open and completed refactoring tasks — cross-pipeline dependency violations, naming, structural debt, and area-imputation consolidation
  • experiments — short model experiment roadmap — five modelling ideas not yet attempted
  • features_README — feature-assembly orchestrator — build_features pipeline, assemble contract, fit/pred parquet split, metadata.json
  • features_builders_README — builder protocol contract — accept(path, cfg, years), return DataFrame keyed by (year, geo_identifier, init_date), registry dispatch
  • CLAUDE — in-package CLAUDE.md — single-line enforcement directive pointing all contributors to DESIGN.md
  • in_package_DOMAIN_MODEL — combined coverage of DOMAIN_MODEL.md, DOMAIN_MODEL2.md, schema.yaml, and gen_linkml_schema.py — ubiquitous language, entities, bounded contexts, slice abstraction

PRs (showboat documentation)

Most-recent first:

  • PR-372 — require forecast.residual_mode + gate forecast on run_dir compatibility
  • PR-369 — forecast multiple season_years per init_date
  • PR-363 — unbreak Streamlit dashboard startup after reference-data refactor
  • PR-361 — USE MAPIE Conformalise — multi-mode CalibrationResult with mode-keyed sidecars
  • PR-360 — BRAZIL SOY — replace evaluation.wasde_path with reference_data discriminated union
  • PR-353 — rewrite domain model into canonical v2 structure (9 sections, LinkML schema gen)
  • PR-345 — S3 path support across predict stage + CLI (three-layer fix)
  • PR-340 — Dashboard — window-aware metrics, configurable truth source, generic vintage subset
  • PR-339 — 9-phase restructure — flatten inner src/, break import cycle, build canonical subpackage tree
  • PR-331 — populate weather_correction_bu_ac and add P90 bands to delivery CSVs
  • index — index of ingested PR source pages for commodity_hindcast — showboat documentation

Commits

  • index — index of commit-history source pages for the commodity_hindcast package
  • timeline — curated thematic timeline of commits touching commodity_hindcast (Apr 2026 – May 2026)