commodity_hindcast Wiki — Index¶
Auto-regenerated after P6; updated after P7 lint pass. Total pages: 109.
Plan¶
- IMPLEMENTATION_PLAN — master reference for building this wiki
Schema¶
- AGENTS — maintenance schema; read before reading or writing any wiki page
- log — append-only chronological record of ingests, queries, and lint passes
Synthesis¶
- overview — Top-down overview of the commodity_hindcast package — what it does, how it is structured, where the seams are
- thesis — Editorial thesis — what commodity_hindcast is good at, its central tensions, and the architectural facts that matter most
Domain model¶
- Domain model README — entry point and reading order for the commodity_hindcast formal domain model
- ENTITIES — canonical entity catalogue for the commodity_hindcast domain (~30 entities, 5 tiers)
- RELATIONSHIPS — inter-entity relationships in the commodity_hindcast domain (~70 relationships)
- AGGREGATES — domain aggregates and consistency boundaries in commodity_hindcast
- BOUNDED_CONTEXTS — DDD bounded contexts of the commodity_hindcast domain (~11 contexts + Mermaid context map)
- ER_DIAGRAM — Mermaid entity-relationship diagrams for the commodity_hindcast domain (three diagrams)
- delta_vs_existing — relation between commodity_hindcast_kb/domain_model and the in-package domain-modelling/
Entities¶
Tier 1 — Core domain¶
- Commodity — root discriminator for a crop being modelled — drives calendar, feature columns, yield units, and plausibility bounds
- SeasonYear — integer harvest-year label that anchors a crop season; primary grouping key across feature parquets, fold labels, and delivery CSVs
- InitDate — calendar date on which a within-season forecast is anchored; features are known up to init_date minus lag_days
- Region — geographic administrative unit at ADM0/ADM1/ADM2 level, identified by a canonical GeoIdentifier string
- Yield — scalar crop yield measurement or prediction; internally always kg/ha, converted to delivery units only at the delivery boundary
- Fold — walk-forward CV fold identified by fold_label string; numeric labels encode a season year test cutoff, "production" encodes the no-holdout fit
Tier 2 — Configuration aggregates¶
- ExperimentConfig — root pydantic-settings aggregate — resolves all paths, holds every subordinate config block, and is passed as the single config argument to every pipeline stage
- CommodityConfig — commodity-specific constants — crop calendar, builder registry, feature/target column names, unit weights, and plausibility bounds
- ModelConfig — selects the detrend strategy and regression estimator for the FIT stage, plus sample-weight and fit-aggregation settings
- FeatureBuilderConfig — discriminated-union config for the five feature data-source readers — YieldsBuilder, WeatherBuilder, ClimoBuilder, NDVIBuilder, StressBuilder
- ReferenceYieldSpec — discriminated union over external reference-yield loader specifications — WasdeRefSpec, ConabFinalRefSpec, ConabLevantamentoRefSpec
- BiasCorrectorConfig — configures the national-scale residual bias corrector fitted during the POSTPROCESS stage
- PostprocessConfig — configures the POSTPROCESS stage — bias correction and conformal calibration mode selection
- EditRuleConfig — discriminated union of Fellegi-Holt edit rules applied to raw survey data before feature assembly — detection rules and corrective operations
- ExperimentProtocolConfig — walk-forward CV schedule — test years, expanding-window strategy, and production-fold county-selection thresholds
- ForecastConfig — forecast-time paths, mandatory residual_mode, and runtime-injected init_date — present only in forecast mode
- DeliveryConfig — delivery-phase configuration controlling client-facing CSV output — model name, CI bands, and post-transform flags for CI narrowing and frozen-tail removal
Tier 3 — Pipeline artefacts¶
- RunDir — the on-disk persistence root for one experiment run — timestamped directory under INPUT_DATA_DIR/runs/ holding every pipeline artefact, config snapshot, and delivery CSV
- ExperimentResult — frozen dataclass aggregate root for all hindcast artefacts under one run_dir — lazy handle to config, HindcastSlices, and ForecastSlices
- HindcastSlice — lazy per-fold artefact handle — frozen dataclass exposing paths and loaders for one walk-forward CV fold's detrender, regressor, fill values, and prediction parquets
- ForecastSlice — lazy per-(season_year, init_date) artefact handle — frozen dataclass for one in-season forecast, with paths isolated under run_dir/forecast/{season_year}/{init_date}/
- CalibrationResult — frozen, persistable dataclass holding fitted conformal half-widths — save/load via long-format parquet sidecars; exposes predict_interval for forecast and delivery consumers
Tier 4 — Behavioural roles¶
- FeatureBuilder — BuilderFn Protocol — the callable contract every feature builder must satisfy; dispatched by string key from builders/registry.py
- Detrender — AbstractDetrend ABC — three concrete yield-trend detrenders (LinearState, GaussianWindow, PartialPooling) with a shared TREND_AXIS convention (epoch 1980-01-01, unit year)
- Regressor — AbstractRegressionImpl ABC — three concrete residual regressors (Ridge, PCA-Ridge, XGBoost) operating on detrended features; per-regressor persistence formats documented
- MetaModel — post-processing role layer covering the BiasCorrector hierarchy and conformal calibration (CalibrationResult, MapieConformalRegressor)
- ReferenceYieldLoader — BaseReferenceYieldLoader ABC — four concrete loaders (WasdeLoader, ConabFinalLoader, ConabLevantamentoLoader, plus NASS) dispatched via ReferenceYieldSpec discriminated union
Tier 5 — Delivery and validation¶
- HindcastDelivery — top-level delivery container for one commodity × ADM level × generated date; enforces no-duplicate-key and equal-init-date-count invariants across all rows
- DeliveryRow — single row in a hindcast/forecast delivery CSV; 17 typed fields enforcing CI ordering and init-date consistency via Pydantic validators
- Check — frozen preflight value object; name + passed + message + critical; run_preflight() raises SystemExit on any critical failure
- SeasonWindow — frozen dataclass on CommodityConfig defining a named temporal aggregation window in season-DOY coordinates — consumed by climo and weather builders
Concepts¶
- adm_levels — the ADM0/ADM1/ADM2 geographic hierarchy — how geo_identifier encodes level, make_geo_identifier construction, and delivery aggregation from county to national
- bias_correction — national-scale coverage bias correction — AbstractBiasCorrector hierarchy, CoverageBiasCorrector formula, NoBiasCorrector pass-through, and per-fold persistence
- conformal_calibration — redirect to conformal_modes; quick-reference for CalibrationResult save/load symbols
- resolvable_path — the ResolvablePath type alias — an Annotated AnyPath that auto-resolves relative paths against data_root and drives automatic preflight coverage
- climo_materialisation — how climatology is materialised for forecast indices — materialise_forecast_indices, materialised_climo_filepath, and the long-range stub for years beyond zarr coverage
- conformal_modes — the four residual modes for conformal prediction intervals — recipes, dispatch, per-mode parquet sidecar layout, and calibration-set size vs bias trade-offs
- hindcast_vs_forecast — what hindcast and forecast are conceptually, how their pipelines differ, and where the read-only boundary between them is enforced
- input_data_dir_contract — how INPUT_DATA_DIR is the sole data-root resolver — DESIGN.md Clause 6, require_input_data_dir(), and the per-pipeline values
- mlflow_tracking — how the pipeline logs runs, params, and artefacts to MLflow — decorators, SQLite backend, artefact tagging, and the parallel-run DB-locking issue
- residual_modes — the forecast.residual_mode mandatory field — four invariant strings, the validate_residual_mode gate, and the no-backwards-compat architecture decision
- s3_path_safety — how the pipeline handles S3 URIs safely via AnyPath, ResolvablePath, and AnyPathParam — and the three-layer bug stack that PR-345 fixed
- walk_forward_cv — walk-forward cross-validation fold strategy — expanding window, ExpandingFoldGenerator, production fit, and how CV integrates with conformal calibration
- weather_correction — what weather_correction_bu_ac is — the detrended component of the model yield in delivery units — and why the structural identity beats a vintage delta
Pipelines¶
In execution order:
- preflight — path-existence gates that abort a run at config-load time if required input artefacts are missing — five check-sets, one per pipeline stage
- feature_build — orchestrates the full feature engineering run — iterates configured builders, writes intermediate parquets, then calls assemble to produce fit.parquet, pred.parquet, and metadata.json
- fit — trains detrender then regressor on one fold's training DataFrame; persists detrender.pkl, feature_fill_values.parquet, and the regressor file(s); returns a HindcastSlice handle
- predict — inference kernel for one (season_year, init_date) pair — four-step inverse pipeline (detrend → score → weather-correct → retrend) split into pure-compute predict() and pure-persistence write_walk_forward_outputs()
- postprocess — aggregates walk-forward fold predictions to national level, fits bias corrector and conformal calibration per configured mode, and writes per-mode sidecar parquets plus postprocessed/national.parquet
- evaluate — compute per-fold metrics, write text reports and CSV, generate 15 diagnostic PNG plots from a completed hindcast run_dir
- deliver — convert CV-fold walk_forward_preds to QUBE-format ADM0/ADM1/ADM2 CSV files with unit conversion, CI bands, and Pydantic row validation
- forecast — per-(season_year, init_date) orchestrator composing feature build → predict → postprocess → deliver against an already-fitted run_dir
- multi_year_forecast — multi-season_year forecast — forecasting multiple season_years from one init_date, the long-range climo stub, panel trailing-median imputation, and why output collapses to trend-only beyond zarr coverage
- dashboard — Streamlit hindcast dashboard — run discovery via RunDescriptor, COMMODITY_CONFIG, window-aware MAPE scoring, and six interactive chart sections; reads RunDir artefacts directly with no API layer
Sources¶
Code subsystems¶
- index — index of all code subsystem source pages
- dashboard — Streamlit dashboard that reads RunDir artefacts directly and renders forecast evolution, accuracy, and WASDE-comparison charts for every commodity hindcast run on disk
- delivery — client-facing CSV delivery subsystem — Pydantic row schemas, wide-to-long conversion, ADM aggregation, unit conversion, and S3/local export pipeline
- detrend — detrending subsystem — three state-space detrenders (linear, Gaussian, partial-pooling) sharing a common abstract base and a single trend-axis convention
- diagnostics — metrics computation, per-fold scoring, and rolling national vs NASS/WASDE/CONAB text reports; ADM1/ADM2 error tables; bu/ac unit conversion boundary
- features — feature engineering subsystem — orchestrator, assembler, forecast splicing, long-range stub, and all concrete builders (yields, weather, climo, NDVI, stress)
- lib — survey of all ~25 modules under lib/ — path anchoring, calendar, unit conversion, geo utilities, reference-data loaders, edit-and-imputation, artefact handles, and MLflow tracking helpers
- meta_models — meta-models post-processing subsystem — BiasCorrector hierarchy, CalibrationResult (conformal calibration with save/load), four residual modes, and the run_meta_models stage orchestrator
- orchestration — orchestration and configuration subsystem — CLI entry, Pydantic config, walk-forward runner, preflight gates, fold generation
- plots — diagnostic plots subsystem — PlotRunner I/O orchestration, PlotRegistry/PlotSpec declarative discovery, prep modules, and all 11 plot-function modules
- regression — regression models subsystem — AbstractRegressionImpl base contract, Ridge, PCA+Ridge, and XGBoost regressors, persistence formats, and runtime helpers
- stages — stage modules under commodity_hindcast/stages/ — function signatures, artefact contracts, inner call sequences, and orchestrator flow for hindcast and forecast pipelines
Configs¶
- index — index of all config source pages for commodity_hindcast — experiment YAMLs, ty.toml, LinkML schema
- corn_usa — Corn (USA) experiment config — Apr–Oct season, PCA-Ridge with stress features, WASDE reference
- cotton_usa — Cotton (upland, USA) experiment config — Apr–Nov season, lbs/acre delivery, no stress builder
- soybeans_bra — Brazil soybean (soja) experiment config — southern-hemisphere cross-year season, IBGE yields, CONAB reference
- soybeans_usa — Soybean (USA) experiment config — May–Oct season, 60 lb bushel, no stress builder, local zarr paths
- wheat_usa — Winter wheat (USA) experiment config — cross-year Oct–Jul season, stress builder, 4-phase ramp
- ty_toml — ty type-checker configuration for commodity_hindcast — suppresses pydantic-settings false positives
- domain_schema — LinkML auto-generated schema for commodity_hindcast — structural class/enum catalogue from config.py
Docs¶
- index — catalogue of all source pages derived from in-package documentation files
- README — primary user guide — pipeline diagram, CLI reference, data layout, Make targets, MLflow wiring, project layout
- DESIGN — EARS-format pipeline contract — 35 design decisions covering config, paths, stages, MLflow, code style, features, predictions, delivery, and forecasting
- TODO — backlog of open and completed refactoring tasks — cross-pipeline dependency violations, naming, structural debt, and area-imputation consolidation
- experiments — short model experiment roadmap — five modelling ideas not yet attempted
- features_README — feature-assembly orchestrator — build_features pipeline, assemble contract, fit/pred parquet split, metadata.json
- features_builders_README — builder protocol contract — accept(path, cfg, years), return DataFrame keyed by (year, geo_identifier, init_date), registry dispatch
- CLAUDE — in-package CLAUDE.md — single-line enforcement directive pointing all contributors to DESIGN.md
- in_package_DOMAIN_MODEL — combined coverage of DOMAIN_MODEL.md, DOMAIN_MODEL2.md, schema.yaml, and gen_linkml_schema.py — ubiquitous language, entities, bounded contexts, slice abstraction
PRs (showboat documentation)¶
Most-recent first:
- PR-372 — require forecast.residual_mode + gate forecast on run_dir compatibility
- PR-369 — forecast multiple season_years per init_date
- PR-363 — unbreak Streamlit dashboard startup after reference-data refactor
- PR-361 — USE MAPIE Conformalise — multi-mode CalibrationResult with mode-keyed sidecars
- PR-360 — BRAZIL SOY — replace evaluation.wasde_path with reference_data discriminated union
- PR-353 — rewrite domain model into canonical v2 structure (9 sections, LinkML schema gen)
- PR-345 — S3 path support across predict stage + CLI (three-layer fix)
- PR-340 — Dashboard — window-aware metrics, configurable truth source, generic vintage subset
- PR-339 — 9-phase restructure — flatten inner src/, break import cycle, build canonical subpackage tree
- PR-331 — populate weather_correction_bu_ac and add P90 bands to delivery CSVs
- index — index of ingested PR source pages for commodity_hindcast — showboat documentation