Skip to content

PR #353 — docs(commodity_hindcast): rewrite domain model into canonical v2 structure

At a glance

  • Author: ai-tommytf
  • Merged: 2026-04-29
  • Branch: tl/domain-model-update-refactor
  • Net effect: Replaced the v1 DOMAIN_MODEL.md (disorganised, attribute tables duplicated across ER and Markdown) with a canonical DOMAIN_MODEL2.md (9 sections, single-place-per-fact) and added an auto-generated schema.yaml (LinkML, 42 classes, 21 enums) regenerated via domain-modelling/gen_linkml_schema.py.
  • Why this matters: The v2 domain model is now the authoritative editorial reference for every entity in the pipeline; the machine-readable schema is auto-discovered from the Python codebase and byte-stable across runs.

PR body (faithful extract)

The PR body is the v2 DOMAIN_MODEL.md itself — a comprehensive document reproduced almost verbatim. Key sections are quoted below.

Document structure (table of contents)

§1 Overview
§2 Glossary (single alphabetised table)
§3 Bounded Contexts
§4 Entity Catalogue
  §4.1 ExperimentConfig
  §4.2 CommodityConfig
  §4.3 Builder (value-object union)
  §4.4 ExperimentResult
  §4.5 HindcastSlice
  §4.6 ForecastSlice
  §4.7 HindcastDelivery → DeliveryRow
  §4.8 EditRule (value-object union)
  §4.9 Check
  §4.10 Other Value Objects (compact)
§5 Relationships
§6 Aggregates & Invariants
§7 Lifecycles
  §7.1 Pipeline DAG (linear)
  §7.2 Preflight gate
  §7.3 Fold transition
  §7.4 Mode switch (hindcast vs forecast)
  §7.5 EditRule fire-and-apply
§8 Scenarios (4 end-to-end walkthroughs)
§9 Open Questions
Appendix A — Mermaid ERD
Appendix B — LinkML schema
Appendix C — Change log

§1 Overview

The commodity hindcast pipeline produces yield predictions for agricultural commodities
(corn, soybean, wheat, cotton) at admin levels ADM0/1/2 (country / state / county).
It runs in two modes:

- **Hindcast** — walk-forward cross-validation across historic harvest years; output
  is the audit-grade time series shipped to clients.
- **Forecast** — single in-season prediction at a chosen `init_date`, reusing the
  production-trained model with spliced observed-plus-climatological weather.

Both modes share a single artefact tree on disk (`run_dir`) — the **only** hand-off
contract between pipeline stages. No in-memory objects cross stage boundaries.

§2 Glossary (selected entries)

Term Definition
fold_label Filesystem key for a CV fold. Numeric year ("2020") for walk-forward folds; literal "production" for the no-holdout fit.
geo_identifier Canonical lowercase ADM path: ADM0:usa/ADM1:{state}/ADM2:{county}. The single canonical key. lib/geo/identifiers.py.
init_date Calendar date on which a within-season forecast is issued. ISO YYYY-MM-DD.
season_year Crop year label. (commodity, season_year, season_doy) locates any point in a growing season.
season_doy Day-offset from a commodity's season start; can exceed 366 for cross-year crops (winter wheat).
run_dir On-disk root of one experiment run. Sole hand-off contract between stages.
walk-forward fold CV strategy: train on years < test_year, test on test_year. Expanding window.

Unit convention (canonical): internal storage is always yield_kg_ha, area_harvested_ha, production_kg. Conversion to delivery units (bu_acre for grains, lbs_ac for cotton) happens only at the delivery/ boundary. lib/unit_utils.py.

§3 Bounded Contexts

Context Purpose Owns
Configuration Validate + resolve every YAML setting ExperimentConfig, CommodityConfig, Builder, ModelConfig, PostprocessConfig, DeliveryConfig, ForecastConfig
Feature Assembly Read sources, apply edits, merge Per-builder modules, EditRule system, fit.parquet, pred.parquet
Experiment Walk-forward CV + production fit ExperimentResult, HindcastSlice, detrender + regressor pickles
Postprocessing Aggregate to ADM0, fit bias corrector, compute conformal CIs bias_corrector, conformal half-widths
Delivery Convert per-fold predictions into validated client CSVs HindcastDelivery, DeliveryRow
Forecast Splice obs+climatology weather; predict single init_date ForecastSlice, climo zarr
Tracking Persist run metadata to MLflow one MLflow run per pipeline invocation
Preflight Gate every stage with declarative Checks Check, run_preflight()

§4.7 DeliveryRow fields

Group Fields
Identity commodity, year, init_date, geo_identifier, variable, model
Prediction mean
Benchmarks nass_actual, nass_actual_area_weighted_all, nass_actual_prod_div_area_all, wasde_in_season
Corrections weather_correction_bu_ac
CI bands lower_{50,68,80,90,95}, upper_{50,68,80,90,95}

§7.1 Pipeline DAG

features → hindcast: walk-forward → fit-production → postprocess → evaluate → deliver
                                                                 ↖ forecast mode
                                        fit-production → forecast-features → forecast-predict → postprocess

§7.3 Fold transition

fold_label is a string filesystem key, not a state machine:

"2010", "2011", … "2023"      → numeric walk-forward folds  (cutoff = date(year, 1, 1))
"production"                  → no-holdout fit (uses all data)

§9 Open Questions

Topic Question
marketing_year Should collapse into season_year once WASDE alignment is reworked.
Layering tech-debt delivery/conversions.py imports conformal helpers from stages/run_meta_models.py; both belong in lib/.
Forecast resumption Walk-forward has no per-fold checkpoint — restart re-does all earlier folds. Worth the complexity?
Custom exception hierarchy Only stdlib ValueError/SystemExit used today; typed hierarchy useful for downstream catching?

Appendix B — LinkML schema (key facts)

  • Generator: domain-modelling/gen_linkml_schema.py (auto-discovery via pkgutil.walk_packages + inspect).
  • Regeneration: bash domain-modelling/regen_schema.sh42 classes, 21 enums, 0 import failures, ~5 s, byte-stable.
  • Literal[...] annotations are promoted to inline LinkML enums.
  • Sub-packages skipped at schema generation time: app/, tests/, scripts/.

Files / lines touched

Additions Deletions File
+519 -561 market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL.md
+728 -0 market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL2.md
+372 -0 market_insights_models/src/commodity_hindcast/domain-modelling/gen_linkml_schema.py
+23 -0 market_insights_models/src/commodity_hindcast/domain-modelling/regen_schema.sh
+0 -439 docs/s3_path_support_demo.md (moved out of the showboat location)
+2 -0 .gitignore
+2 -0 market_insights_models/src/commodity_hindcast/DESIGN.md

Cross-references

Lessons captured

  • Every domain fact lives in one place in DOMAIN_MODEL2.md; other sections cross-reference rather than repeat.
  • The v2 doc is organised by editorial purpose (glossary, bounded contexts, entity catalogue, relationships, aggregates, lifecycles, scenarios, open questions) rather than by structure type (ER, class table, state diagram).
  • schema.yaml is auto-generated and gitignored by default; it is never hand-edited.
  • The source-of-truth split: Python source owns structure; DOMAIN_MODEL2.md owns editorial meaning (role, context, invariants, scenarios); schema.yaml is the machine-readable export.
  • The v2 document explicitly records §9 Open Questions — it is the canonical place to track unresolved design decisions.
  • ForecastSlice.root path in §4.6 is outdated by PR-369 which added the {season_year} level; the domain model should be updated when entity pages are written.