PR #353 — docs(commodity_hindcast): rewrite domain model into canonical v2 structure¶

At a glance¶

Author: ai-tommytf
Merged: 2026-04-29
Branch: tl/domain-model-update-refactor
Net effect: Replaced the v1 DOMAIN_MODEL.md (disorganised, attribute tables duplicated across ER and Markdown) with a canonical DOMAIN_MODEL2.md (9 sections, single-place-per-fact) and added an auto-generated schema.yaml (LinkML, 42 classes, 21 enums) regenerated via domain-modelling/gen_linkml_schema.py.
Why this matters: The v2 domain model is now the authoritative editorial reference for every entity in the pipeline; the machine-readable schema is auto-discovered from the Python codebase and byte-stable across runs.

PR body (faithful extract)¶

The PR body is the v2 DOMAIN_MODEL.md itself — a comprehensive document reproduced almost verbatim. Key sections are quoted below.

Document structure (table of contents)¶

§1 Overview
§2 Glossary (single alphabetised table)
§3 Bounded Contexts
§4 Entity Catalogue
  §4.1 ExperimentConfig
  §4.2 CommodityConfig
  §4.3 Builder (value-object union)
  §4.4 ExperimentResult
  §4.5 HindcastSlice
  §4.6 ForecastSlice
  §4.7 HindcastDelivery → DeliveryRow
  §4.8 EditRule (value-object union)
  §4.9 Check
  §4.10 Other Value Objects (compact)
§5 Relationships
§6 Aggregates & Invariants
§7 Lifecycles
  §7.1 Pipeline DAG (linear)
  §7.2 Preflight gate
  §7.3 Fold transition
  §7.4 Mode switch (hindcast vs forecast)
  §7.5 EditRule fire-and-apply
§8 Scenarios (4 end-to-end walkthroughs)
§9 Open Questions
Appendix A — Mermaid ERD
Appendix B — LinkML schema
Appendix C — Change log

§1 Overview¶

The commodity hindcast pipeline produces yield predictions for agricultural commodities
(corn, soybean, wheat, cotton) at admin levels ADM0/1/2 (country / state / county).
It runs in two modes:

- **Hindcast** — walk-forward cross-validation across historic harvest years; output
  is the audit-grade time series shipped to clients.
- **Forecast** — single in-season prediction at a chosen `init_date`, reusing the
  production-trained model with spliced observed-plus-climatological weather.

Both modes share a single artefact tree on disk (`run_dir`) — the **only** hand-off
contract between pipeline stages. No in-memory objects cross stage boundaries.

§2 Glossary (selected entries)¶

Term	Definition
`fold_label`	Filesystem key for a CV fold. Numeric year (`"2020"`) for walk-forward folds; literal `"production"` for the no-holdout fit.
`geo_identifier`	Canonical lowercase ADM path: `ADM0:usa/ADM1:{state}/ADM2:{county}`. The single canonical key. `lib/geo/identifiers.py`.
`init_date`	Calendar date on which a within-season forecast is issued. ISO `YYYY-MM-DD`.
`season_year`	Crop year label. `(commodity, season_year, season_doy)` locates any point in a growing season.
`season_doy`	Day-offset from a commodity's season start; can exceed 366 for cross-year crops (winter wheat).
`run_dir`	On-disk root of one experiment run. Sole hand-off contract between stages.
`walk-forward fold`	CV strategy: train on years `< test_year`, test on `test_year`. Expanding window.

Unit convention (canonical): internal storage is always yield_kg_ha, area_harvested_ha, production_kg. Conversion to delivery units (bu_acre for grains, lbs_ac for cotton) happens only at the delivery/ boundary. lib/unit_utils.py.

§3 Bounded Contexts¶

Context	Purpose	Owns
Configuration	Validate + resolve every YAML setting	`ExperimentConfig`, `CommodityConfig`, `Builder`, `ModelConfig`, `PostprocessConfig`, `DeliveryConfig`, `ForecastConfig`
Feature Assembly	Read sources, apply edits, merge	Per-builder modules, `EditRule` system, `fit.parquet`, `pred.parquet`
Experiment	Walk-forward CV + production fit	`ExperimentResult`, `HindcastSlice`, detrender + regressor pickles
Postprocessing	Aggregate to ADM0, fit bias corrector, compute conformal CIs	bias_corrector, conformal half-widths
Delivery	Convert per-fold predictions into validated client CSVs	`HindcastDelivery`, `DeliveryRow`
Forecast	Splice obs+climatology weather; predict single `init_date`	`ForecastSlice`, climo zarr
Tracking	Persist run metadata to MLflow	one MLflow run per pipeline invocation
Preflight	Gate every stage with declarative `Check`s	`Check`, `run_preflight()`

§4.7 DeliveryRow fields¶

Group	Fields
Identity	`commodity`, `year`, `init_date`, `geo_identifier`, `variable`, `model`
Prediction	`mean`
Benchmarks	`nass_actual`, `nass_actual_area_weighted_all`, `nass_actual_prod_div_area_all`, `wasde_in_season`
Corrections	`weather_correction_bu_ac`
CI bands	`lower_{50,68,80,90,95}`, `upper_{50,68,80,90,95}`

§7.1 Pipeline DAG¶

features → hindcast: walk-forward → fit-production → postprocess → evaluate → deliver
                                                                 ↖ forecast mode
                                        fit-production → forecast-features → forecast-predict → postprocess

§7.3 Fold transition¶

fold_label is a string filesystem key, not a state machine:

"2010", "2011", … "2023"      → numeric walk-forward folds  (cutoff = date(year, 1, 1))
"production"                  → no-holdout fit (uses all data)

§9 Open Questions¶

Topic	Question
`marketing_year`	Should collapse into `season_year` once WASDE alignment is reworked.
Layering tech-debt	`delivery/conversions.py` imports conformal helpers from `stages/run_meta_models.py`; both belong in `lib/`.
Forecast resumption	Walk-forward has no per-fold checkpoint — restart re-does all earlier folds. Worth the complexity?
Custom exception hierarchy	Only stdlib `ValueError`/`SystemExit` used today; typed hierarchy useful for downstream catching?

Appendix B — LinkML schema (key facts)¶

Generator: domain-modelling/gen_linkml_schema.py (auto-discovery via pkgutil.walk_packages + inspect).
Regeneration: bash domain-modelling/regen_schema.sh → 42 classes, 21 enums, 0 import failures, ~5 s, byte-stable.
Literal[...] annotations are promoted to inline LinkML enums.
Sub-packages skipped at schema generation time: app/, tests/, scripts/.

Files / lines touched¶

Additions	Deletions	File
+519	-561	`market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL.md`
+728	-0	`market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL2.md`
+372	-0	`market_insights_models/src/commodity_hindcast/domain-modelling/gen_linkml_schema.py`
+23	-0	`market_insights_models/src/commodity_hindcast/domain-modelling/regen_schema.sh`
+0	-439	`docs/s3_path_support_demo.md` (moved out of the showboat location)
+2	-0	`.gitignore`
+2	-0	`market_insights_models/src/commodity_hindcast/DESIGN.md`

Cross-references¶

Related entity pages: ExperimentConfig, CommodityConfig, ExperimentResult, HindcastSlice, ForecastSlice, DeliveryRow, EditRuleConfig
Related concept pages: pipeline DAG, bounded contexts, unit conventions
Source document: market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL2.md

Lessons captured¶

Every domain fact lives in one place in DOMAIN_MODEL2.md; other sections cross-reference rather than repeat.
The v2 doc is organised by editorial purpose (glossary, bounded contexts, entity catalogue, relationships, aggregates, lifecycles, scenarios, open questions) rather than by structure type (ER, class table, state diagram).
schema.yaml is auto-generated and gitignored by default; it is never hand-edited.
The source-of-truth split: Python source owns structure; DOMAIN_MODEL2.md owns editorial meaning (role, context, invariants, scenarios); schema.yaml is the machine-readable export.
The v2 document explicitly records §9 Open Questions — it is the canonical place to track unresolved design decisions.
ForecastSlice.root path in §4.6 is outdated by PR-369 which added the {season_year} level; the domain model should be updated when entity pages are written.