Thesis on commodity_hindcast¶

The central design move¶

The most consequential architectural choice in commodity_hindcast is the strict separation of config and artefacts from all computation, enforced at both ends by the same mechanism: the filesystem. ExperimentConfig is fully validated before any stage starts; every stage writes output to a deterministic path under run_dir; every downstream stage reconstructs its world by reading from that path via ExperimentResult.from_run_dir(). No in-memory objects cross stage boundaries. The implication is that any stage can be re-run independently — FIT, POSTPROCESS, EVALUATE, and DELIVER are independently idempotent against the same run_dir. This design choice makes the pipeline resilient to partial failures without requiring a workflow orchestrator, and it makes the artefact tree self-describing enough that a downstream consumer (the dashboard, a QA script, the export gate) can reconstruct full context from a single run_dir path string.

The second central choice is making INPUT_DATA_DIR the sole data-root resolver (DESIGN.md Clause 6). There is no data_root field in user-facing YAML configs, no cwd-relative fallback, and no silent default. Missing the env var raises a RuntimeError with an actionable message at config-load time, before any computation begins. Combined with the ResolvablePath + AnyPath abstraction for paths, this gives the pipeline a single, auditable, S3-transparent anchor for every data path across local dev, QA ECS tasks, and production runs.

What this package gets right¶

Atomic config validation as the first line of defence. ExperimentConfig inherits pydantic_settings.BaseSettings and a chain of model_validator(mode="after") passes resolves every nested path, checks every cross-field constraint, and rejects invalid configs before a single row of data is read. The consequence is that runtime failures — the expensive kind that happen 45 minutes into a feature-build — are almost always caused by data issues rather than config issues. The config-as-aggregate design is documented explicitly in AGGREGATES.md with four rationale sections explaining why the config is one aggregate rather than seven.

Walk-forward CV that mirrors production scoring exactly. The hindcast loop fits a fresh model for each test year, scores it at the same init-dates used in forecast mode, and persists identical artefacts. Because the production fold uses the exact same code path as the CV folds, the OOS residuals used for conformal calibration are drawn from predictions made under conditions that genuinely mirror production — no information leakage, no held-out-data divergence. This is the correct way to build calibration curves for in-season agricultural forecasting, and the package does it by construction.

PR documentation as first-class architecture record. PRs #339, #345, #360, #361, #369, and #372 each carry a detailed showboat that records not just what changed but why the previous design was wrong, what the data consequences were, and what was learned. The Brazil soy PR (#360) includes a before/after stage5_metrics.txt proving the fix resolved a silent factor-67 unit bug. The S3 path PR (#345) names the exact pathlib.Path(S3Path(...)) call that caused the QA failure and the 50-minute round-trip it cost. This pattern — treating the PR body as a permanent architecture record rather than a merge formality — is rare and valuable.

Discriminated unions for builders and reference-data specs. Both Builder (five variants: yields, weather, climo, ndvi, stress) and ReferenceYieldSpec (three variants: wasde, conab_final, conab_levantamento) use pydantic discriminated unions. Adding a new source requires writing a new variant and registering it; the calling code does not branch on a commodity string or a type name. The extra="forbid" policy on DeliveryRow enforces the same principle at the output boundary: an undeclared benchmark column from a mis-named ReferenceYieldSpec raises a ValidationError rather than silently vanishing.

ResolvablePath + AnyPath as a unified path abstraction. Every data path in the config resolves to an AnyPath at load time. All downstream consumers accept Path | AnyPath and pass paths to polars/pandas via str(path). The abstraction is complete enough that cli predict s3://…/<run_dir> works in production without any code path divergence, reducing the predict-stage debug loop from ~50 minutes to ~30 seconds (PR #345). See concepts/s3_path_safety.md.

Tensions and trade-offs¶

Two layering violations in delivery/. delivery/conversions.py imports conformal half-width helpers from stages/run_meta_models.py, and delivery/export.py imports preflight helpers from run/preflight.py. Both create upward edges from the Delivery bounded context (near the output boundary) into the Experiment orchestration layer (near the top of the import DAG), which is the single direction rule in DESIGN.md Clause 19. The correct home for conformal helpers is lib/ (a new lib/conformal/ module); the correct home for the export preflight helper is either lib/ or inline in the delivery module. These are the only tracked layering violations in the package. They are not subtle — they are named explicitly in BOUNDED_CONTEXTS.md — but they have not been fixed because the helpers work correctly in place. The risk is that adding a test for the Delivery context imports stages/, which drags the full training stack into the test fixture.

DESIGN.md vs code drift on postprocess output path. DESIGN.md Clause 34 documents the POSTPROCESS stage output as postprocessed/{experiment_key}_national.parquet. In the actual codebase the ForecastSlice.postprocessed_national_path property resolves to postprocessed/national.parquet (without the commodity key prefix) under the forecast sub-directory. The two paths coexist: the hindcast postprocessed path uses the commodity-keyed form; the forecast postprocessed path uses the bare form. A reader following DESIGN.md Clause 34 alone will not correctly predict the forecast artefact path. See domain_model/delta_vs_existing.md for the broader catalogue of design-doc vs code discrepancies.

Multi-year forecasting is a stub, not a complete implementation. PR #369 introduced forecast/{season_year}/{init_date}/ path nesting and the long-range climo/stress stubs (forecast_long_range_stub.py). For season_year values beyond the climatology zarr's temporal coverage, the stubs fill weather features with trailing medians, and the forecast output collapses to the trend model with no weather signal. This is a known, logged behaviour — three WARNING lines are emitted per stub call — but it is easy to miss that a multi-season forecast for years 2+ is trend-only. A future implementor of true long-range probabilistic weather will need to replace the stubs; in the meantime any downstream consumer that treats multi-year forecast CSVs as full-signal predictions is silently wrong. See pipelines/multi_year_forecast.md.

The _OOS_MODES frozenset is hand-maintained. stages/run_meta_models.py contains a frozenset of residual modes that require walk-forward CV fold artefacts to be present (hindcast_oos_per_init_date, hindcast_oos_per_year, hindcast_oos_fully_pooled). This set is not derived programmatically from the ResidualMode Literal defined in models/meta_models/types.py. If a new OOS mode is added to the Literal without updating the frozenset, conformal calibration will silently treat it as an in-sample mode and produce incorrect CI bands. The fix is to compute the set from the Literal at module load time, or add a test that asserts membership. See concepts/conformal_modes.md.

AGGREGATES.md has a stale claim on CalibrationResult. The aggregate page marks CalibrationResult as "transient" — a value returned from stages/run_meta_models.py and not independently persisted. The source file models/meta_models/conformalise.py:208 defines CalibrationResult.save(path) and CalibrationResult.load(path) methods that write to and read from run_dir/conformal/{mode}.parquet. The entity page in ENTITIES.md is correct; the aggregate page hedge is a conservative mismatch. Downstream readers should treat CalibrationResult as a persistent aggregate child.

The non-obvious facts a future contributor must know¶

INPUT_DATA_DIR for commodity_hindcast resolves to the repo root, not repo_root/data/. Config paths like data/nass/… and data/wasde/… work because the repo root contains a data symlink pointing to treefera-market-insights-commodity-hindcast/data. Setting INPUT_DATA_DIR to repo_root/data will cause all relative paths to resolve one level too deep and every preflight check will fail.
ConformalConfig does not exist as a Python class. The concept is real — it is documented in ENTITIES.md Tier 2 and in this wiki as a named entity — but the actual field is PostprocessConfig.conformalise: tuple[Literal[…], …], a plain tuple attribute on PostprocessConfig. There is no ConformalConfig Pydantic model; schema.yaml does not emit one.
There is one DeliveryRow class for all ADM levels. There are no ADM0Row, ADM1Row, or ADM2Row classes in delivery/schemas.py. The ADM level is inferred at runtime from the geo_identifier prefix (ADM0:…, ADM0:…/ADM1:…, ADM0:…/ADM1:…/ADM2:…). Any code or documentation that names three separate row classes is wrong.
No MetaModel ABC exists in code. The bias-corrector and conformaliser hierarchies are documented together as a conceptual "meta-model" layer in the domain model, but there is no AbstractMetaModel base class. AbstractBiasCorrector is a real ABC; apply_conformal is a module-level function. The two are unified only as a documentation construct.
The dashboard reads run-dir CSVs directly with no API layer. app/run_loader.py uses ExperimentResult.from_run_dir() and reads delivery CSVs directly from disk. Any change to the delivery CSV schema or the run_dir layout is a breaking change for the dashboard with no interface contract to catch it.
Run-dir creation lives in stages/run_hindcast._create_run_root, not in run/runner.py. A reader following the RunRunner class in run/runner.py will not find where the timestamped directory is created. The actual creation is in stages/run_hindcast.py before the runner is invoked, as documented in entities/ExperimentConfig.md lifecycle step 3.
forecast.residual_mode is mandatory since PR #372. Older configs/ YAML files that pre-date PR #372 and include a forecast: block without residual_mode: will fail Pydantic validation at load time with a ValidationError. There is no default value and no migration shim. Any config written before 2026-05-05 that uses the forecast pipeline must be updated manually.

Where the wiki and the in-package docs disagree¶

The following discrepancies are catalogued in detail in domain_model/delta_vs_existing.md:

CalibrationResult persistence — AGGREGATES.md labels it transient; ENTITIES.md (and the source code) document save/load methods. Treat the entity page as authoritative.
NassSpec phantom name — present in DOMAIN_MODEL.md ubiquitous-language glossary (implicitly); does not exist in code. NASS yield is a builder (YieldsBuilder), not a reference spec. The kb model corrects this.
ADM0Row/ADM1Row/ADM2Row phantom classes — suggested by an informal reading of DOMAIN_MODEL.md; do not exist in delivery/schemas.py. One DeliveryRow class handles all levels.
Eleven vs seven/eight bounded contexts — in-package models list seven (DOMAIN_MODEL.md) or eight (DOMAIN_MODEL2.md) contexts; this kb identifies eleven by splitting out Reference Data, Geo & Identifiers, and Dashboard.
DESIGN.md Clause 34 postprocess path — clause documents postprocessed/{experiment_key}_national.parquet; ForecastSlice uses postprocessed/national.parquet without the commodity key. Both forms exist in the codebase for hindcast and forecast paths respectively.

A few hypotheses worth testing¶

Conformal interval widths are dominated by the per_init_date residual structure. The default residual_mode is hindcast_oos_per_init_date, which calibrates separately for each day-of-season position. If residual variance is actually stable across init-dates (i.e. the per-year or pooled modes produce equally well-calibrated intervals), switching to hindcast_oos_fully_pooled would make the calibration more stable on short hindcast runs where early init-dates have only a few OOS residuals.

The selection bias correction is working against the conformal calibration. The known bug documented in project memory (SBC formula matches QUBE but wrong sign/magnitude because weather features drop core corn-belt counties before training) means that the bias corrector is fitting on a systematically truncated county distribution. If this is the case, the conformal residuals are calibrated against biased predictions, and the interval coverage guarantees from CalibrationResult do not hold unconditionally. This is worth isolating by running with BiasCorrectorConfig(kind="none") on corn and comparing calibration curves.

The long-range stub output should be flagged in delivery CSVs. Currently a forecast for a season_year beyond zarr coverage produces CSVs that look structurally identical to full-signal forecasts. Adding a boolean is_trend_only flag or a dedicated model column value ("commodity_hindcast_trend_only") would let downstream consumers filter or discount these rows without having to inspect the run_dir logs.

Maintenance recommendations¶

Fix the AGGREGATES.md transient/persistent claim on CalibrationResult — update the "Notes on the canonical entity list" section at the bottom of AGGREGATES.md to reflect that CalibrationResult has save/load methods and writes to run_dir/conformal/{mode}.parquet. A one-line edit.
Align DESIGN.md Clause 34 with the actual forecast postprocess path — either document both forms explicitly or note that the forecast sub-directory uses postprocessed/national.parquet without the commodity key prefix.
Replace the hand-maintained _OOS_MODES frozenset — derive it from the ResidualMode Literal at module load time, or add a regression test asserting that all Literal values except "in_sample_pooled" are members of the frozenset.
Add a PR-template prompt for wiki updates — large architectural PRs (#369, #372) required updating the wiki manually post-hoc. A checklist item asking "did this PR change a bounded context boundary, an artefact path, or a mandatory config field?" would catch wiki drift earlier.
Migrate conformal helpers from stages/run_meta_models.py to lib/ — resolves both tracked layering violations (delivery/conversions.py and delivery/export.py) in one move. The helpers are pure functions and have no stage-specific dependencies; the migration is mechanical.