PR #353 — docs(commodity_hindcast): rewrite domain model into canonical v2 structure
At a glance
- Author: ai-tommytf
- Merged: 2026-04-29
- Branch:
tl/domain-model-update-refactor
- Net effect: Replaced the v1
DOMAIN_MODEL.md (disorganised, attribute tables duplicated across ER and Markdown) with a canonical DOMAIN_MODEL2.md (9 sections, single-place-per-fact) and added an auto-generated schema.yaml (LinkML, 42 classes, 21 enums) regenerated via domain-modelling/gen_linkml_schema.py.
- Why this matters: The v2 domain model is now the authoritative editorial reference for every entity in the pipeline; the machine-readable schema is auto-discovered from the Python codebase and byte-stable across runs.
PR body (faithful extract)
The PR body is the v2 DOMAIN_MODEL.md itself — a comprehensive document reproduced almost verbatim. Key sections are quoted below.
Document structure (table of contents)
§1 Overview
§2 Glossary (single alphabetised table)
§3 Bounded Contexts
§4 Entity Catalogue
§4.1 ExperimentConfig
§4.2 CommodityConfig
§4.3 Builder (value-object union)
§4.4 ExperimentResult
§4.5 HindcastSlice
§4.6 ForecastSlice
§4.7 HindcastDelivery → DeliveryRow
§4.8 EditRule (value-object union)
§4.9 Check
§4.10 Other Value Objects (compact)
§5 Relationships
§6 Aggregates & Invariants
§7 Lifecycles
§7.1 Pipeline DAG (linear)
§7.2 Preflight gate
§7.3 Fold transition
§7.4 Mode switch (hindcast vs forecast)
§7.5 EditRule fire-and-apply
§8 Scenarios (4 end-to-end walkthroughs)
§9 Open Questions
Appendix A — Mermaid ERD
Appendix B — LinkML schema
Appendix C — Change log
§1 Overview
The commodity hindcast pipeline produces yield predictions for agricultural commodities
(corn, soybean, wheat, cotton) at admin levels ADM0/1/2 (country / state / county).
It runs in two modes:
- **Hindcast** — walk-forward cross-validation across historic harvest years; output
is the audit-grade time series shipped to clients.
- **Forecast** — single in-season prediction at a chosen `init_date`, reusing the
production-trained model with spliced observed-plus-climatological weather.
Both modes share a single artefact tree on disk (`run_dir`) — the **only** hand-off
contract between pipeline stages. No in-memory objects cross stage boundaries.
§2 Glossary (selected entries)
| Term |
Definition |
fold_label |
Filesystem key for a CV fold. Numeric year ("2020") for walk-forward folds; literal "production" for the no-holdout fit. |
geo_identifier |
Canonical lowercase ADM path: ADM0:usa/ADM1:{state}/ADM2:{county}. The single canonical key. lib/geo/identifiers.py. |
init_date |
Calendar date on which a within-season forecast is issued. ISO YYYY-MM-DD. |
season_year |
Crop year label. (commodity, season_year, season_doy) locates any point in a growing season. |
season_doy |
Day-offset from a commodity's season start; can exceed 366 for cross-year crops (winter wheat). |
run_dir |
On-disk root of one experiment run. Sole hand-off contract between stages. |
walk-forward fold |
CV strategy: train on years < test_year, test on test_year. Expanding window. |
Unit convention (canonical): internal storage is always yield_kg_ha, area_harvested_ha, production_kg. Conversion to delivery units (bu_acre for grains, lbs_ac for cotton) happens only at the delivery/ boundary. lib/unit_utils.py.
§3 Bounded Contexts
| Context |
Purpose |
Owns |
| Configuration |
Validate + resolve every YAML setting |
ExperimentConfig, CommodityConfig, Builder, ModelConfig, PostprocessConfig, DeliveryConfig, ForecastConfig |
| Feature Assembly |
Read sources, apply edits, merge |
Per-builder modules, EditRule system, fit.parquet, pred.parquet |
| Experiment |
Walk-forward CV + production fit |
ExperimentResult, HindcastSlice, detrender + regressor pickles |
| Postprocessing |
Aggregate to ADM0, fit bias corrector, compute conformal CIs |
bias_corrector, conformal half-widths |
| Delivery |
Convert per-fold predictions into validated client CSVs |
HindcastDelivery, DeliveryRow |
| Forecast |
Splice obs+climatology weather; predict single init_date |
ForecastSlice, climo zarr |
| Tracking |
Persist run metadata to MLflow |
one MLflow run per pipeline invocation |
| Preflight |
Gate every stage with declarative Checks |
Check, run_preflight() |
§4.7 DeliveryRow fields
| Group |
Fields |
| Identity |
commodity, year, init_date, geo_identifier, variable, model |
| Prediction |
mean |
| Benchmarks |
nass_actual, nass_actual_area_weighted_all, nass_actual_prod_div_area_all, wasde_in_season |
| Corrections |
weather_correction_bu_ac |
| CI bands |
lower_{50,68,80,90,95}, upper_{50,68,80,90,95} |
§7.1 Pipeline DAG
features → hindcast: walk-forward → fit-production → postprocess → evaluate → deliver
↖ forecast mode
fit-production → forecast-features → forecast-predict → postprocess
§7.3 Fold transition
fold_label is a string filesystem key, not a state machine:
"2010", "2011", … "2023" → numeric walk-forward folds (cutoff = date(year, 1, 1))
"production" → no-holdout fit (uses all data)
§9 Open Questions
| Topic |
Question |
marketing_year |
Should collapse into season_year once WASDE alignment is reworked. |
| Layering tech-debt |
delivery/conversions.py imports conformal helpers from stages/run_meta_models.py; both belong in lib/. |
| Forecast resumption |
Walk-forward has no per-fold checkpoint — restart re-does all earlier folds. Worth the complexity? |
| Custom exception hierarchy |
Only stdlib ValueError/SystemExit used today; typed hierarchy useful for downstream catching? |
Appendix B — LinkML schema (key facts)
- Generator:
domain-modelling/gen_linkml_schema.py (auto-discovery via pkgutil.walk_packages + inspect).
- Regeneration:
bash domain-modelling/regen_schema.sh → 42 classes, 21 enums, 0 import failures, ~5 s, byte-stable.
Literal[...] annotations are promoted to inline LinkML enums.
- Sub-packages skipped at schema generation time:
app/, tests/, scripts/.
Files / lines touched
| Additions |
Deletions |
File |
| +519 |
-561 |
market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL.md |
| +728 |
-0 |
market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL2.md |
| +372 |
-0 |
market_insights_models/src/commodity_hindcast/domain-modelling/gen_linkml_schema.py |
| +23 |
-0 |
market_insights_models/src/commodity_hindcast/domain-modelling/regen_schema.sh |
| +0 |
-439 |
docs/s3_path_support_demo.md (moved out of the showboat location) |
| +2 |
-0 |
.gitignore |
| +2 |
-0 |
market_insights_models/src/commodity_hindcast/DESIGN.md |
Cross-references
Lessons captured
- Every domain fact lives in one place in
DOMAIN_MODEL2.md; other sections cross-reference rather than repeat.
- The v2 doc is organised by editorial purpose (glossary, bounded contexts, entity catalogue, relationships, aggregates, lifecycles, scenarios, open questions) rather than by structure type (ER, class table, state diagram).
schema.yaml is auto-generated and gitignored by default; it is never hand-edited.
- The source-of-truth split: Python source owns structure;
DOMAIN_MODEL2.md owns editorial meaning (role, context, invariants, scenarios); schema.yaml is the machine-readable export.
- The v2 document explicitly records
§9 Open Questions — it is the canonical place to track unresolved design decisions.
ForecastSlice.root path in §4.6 is outdated by PR-369 which added the {season_year} level; the domain model should be updated when entity pages are written.