Source: In-Package Domain Model¶

What it is¶

Four co-located files under market_insights_models/src/commodity_hindcast/domain-modelling/:

DOMAIN_MODEL2.md (v2, 865 lines, 2026-04-29) — the editorial domain model: ubiquitous language, bounded contexts, entity catalogue, relationships, aggregates, pipeline phases with stage-orchestrator cross-reference, key invariants, walk-through scenario, slice abstraction, AbstractSlice protocol, and package import DAG. This is the canonical human-maintained reference.
DOMAIN_MODEL.md (v1 curated rewrite, 729 lines) — an earlier version organised into 9 numbered sections (Overview, Glossary, Bounded Contexts, Entity Catalogue, Relationships, Aggregates & Invariants, Lifecycles, Scenarios, Open Questions). Still load-bearing: it adds Scenarios §8, Open Questions §9, and ExperimentConfig.from_yaml / run_dir_base nuances not duplicated in DOMAIN_MODEL2.md.
schema.yaml — auto-generated LinkML schema: 42 classes, 21 enums, 0 import failures. Machine-readable structural export. Regenerated via regen_schema.sh; never hand-edited.
gen_linkml_schema.py — the codebase-agnostic generator script. Walks every importable module under a package via pkgutil.walk_packages, discovers Pydantic BaseModel subclasses, dataclasses, and Enum subclasses through inspect introspection, and emits LinkML YAML via pyyaml. Deterministic: same source → byte-identical YAML.

Source-of-truth split (from DOMAIN_MODEL.md Appendix B):

Layer	Source of truth
Structure (classes, attributes, types, cardinality)	The Python source — `commodity_hindcast/*/.py`
Editorial (role, bounded context, identifier, invariants, scenarios, glossary)	`DOMAIN_MODEL2.md` and `DOMAIN_MODEL.md`
Machine-readable export	`schema.yaml` (auto-generated, never hand-edited)

"If the Markdown and the YAML disagree about structure, the YAML is right (regenerate the doc); if they disagree about meaning, the Markdown is right (it's the only place meaning is asserted)."

Section-by-section summary¶

Ubiquitous language (DOMAIN_MODEL2.md §1)¶

Complete glossary drawn from DESIGN.md, READMEs, and canonical type names. Key groups:

Temporal vocabulary

Term	Meaning
`season_year`	Crop year label. Paired with `season_doy` + `commodity` to locate any point in a growing season.
`season_doy`	Integer day-offset from season start. Can exceed 366 for cross-year crops (winter wheat). Distinct from `calendar_doy`.
`init_date`	Specific calendar date on which a within-season forecast is issued. Features known up to `init_date − lag_days`.
`harvest_date`	Calendar date on which `season_doy == harvest_season_doy`. Marks end-of-harvest features for `fit.parquet`.
`gstd`	"growing-season-to-date" — accumulators that reset at season start. Never `ytd`.
`MonthDay`	Recurring calendar day `(month, day)`, year-free value object.
`SeasonWindow`	Named aggregation window `(name, sdoy_start, sdoy_end)`.

Geographic vocabulary

Term	Meaning
`geo_identifier`	Lowercase ADM path `ADM0:usa/ADM1:{state}/ADM2:{county}`. The one canonical identifier — no FIPS codes, no mixed case. `lib/geo/identifiers.py`.
`AggregationLevel`	`Literal["ADM0", "ADM1", "ADM2"]`.
`included_geo_identifiers`	`frozenset[str]` of modelled counties (top 95% by production). Required kwarg threaded through eval chain.

Data-shape vocabulary

Term	Meaning
`Builder`	Plug-in module reading a source into a parquet keyed by `(year, geo_identifier, init_date)`.
`fit.parquet`	End-of-harvest features — one row per (geo, year), target known.
`pred.parquet`	In-season features — all init_dates, target unknown/lagged.
`walk_forward_preds.parquet`	County × init_date simulated yields for a single fold. Same schema across folds.
`fold_label`	CV fold identifier. Numeric year (`"2020"`) or literal `"production"`.
`run_dir`	On-disk artefact root. Sole hand-off contract between stages.
`metadata.json`	Sidecar carrying `index_cols`, `feature_cols`, `target_col`.

Unit vocabulary (canonical)

Internal storage: yield_kg_ha, area_harvested_ha, production_kg. Delivery boundary only: yield_bu_ac, yield_lbs_ac. "Columns without a unit suffix are forbidden."

Lifecycle vocabulary — four stages

Stage	Responsibility
FIT	Detrend → impute → regress → save artefacts + train_preds + walk_forward_preds. Zero metrics, zero plots.
POSTPROCESS	Aggregate to national → fit bias corrector → attach conformal intervals.
EVALUATE	Compute metrics, generate plots. Read-only consumer.
DELIVER	Emit client-facing CSV per ADM level from walk_forward_preds + postprocessed.

Bounded contexts (DOMAIN_MODEL2.md §2, DOMAIN_MODEL.md §3)¶

Seven contexts:

Context	Aggregate Roots
Configuration	`ExperimentConfig`, `CommodityConfig`
Feature Assembly	`Builder` (+ registry), `metadata.json`
Experiment / Modelling	`ExperimentResult` (root) → `HindcastSlice`
Post-processing	`bias_corrector`, conformal intervals
Delivery	`HindcastDelivery` → `DeliveryRow`
Forecast	`ForecastSlice` under `ExperimentResult`
Tracking / Preflight	MLflow run, `Check`

Entity catalogue (DOMAIN_MODEL2.md §4.1–4.10, DOMAIN_MODEL.md §4)¶

ExperimentConfig (config.py, identity: experiment_name) The top-level settings root loaded via pydantic-settings (YAML > env > CLI). Has no run_dir field — only run_dir_base; run_dir is owned by ExperimentResult. Resolved data_root comes from INPUT_DATA_DIR env var via require_input_data_dir(). Invariants: experiment_name matches [a-zA-Z0-9_-]+; all ResolvablePath fields resolve under data_root; forecast.init_date required when forecast is set; output dirs created on resolve.

CommodityConfig (config.py:284, identity: commodity) Source of truth for all commodity-specific constants: calendar, builders, feature/target columns, unit weights. Key fields: season_start (MonthDay), harvest_season_doy, hindcast_init_season_doys, bushel_weight_lbs, delivery_unit, yield_range, feature_cols/target_col, builders dict, climo_windows/weather_windows.

Builder (config.py:243, discriminated union on type) Five variants: YieldsBuilder, WeatherBuilder, ClimoBuilder, NDVIBuilder, StressBuilder. Common base (BaseBuilderConfig, config.py:161): filepath, geo_id_col, required_for_pred_parquet.

ExperimentResult (lib/results/run_result.py:32, identity: run_dir) Aggregate root — frozen dataclass that lazily discovers fold artefacts on disk. Fields: config, hindcast_slices: tuple[HindcastSlice, ...], forecast_slices: tuple[ForecastSlice, ...], run_dir. Constructor: from_run_dir(run_dir). "No in-memory cache — readers are responsible for materialising." Invariant: if forecast_slices non-empty, must contain a production hindcast slice.

HindcastSlice (lib/results/results_slice.py:112, identity: (run_dir, fold_label)) Lazy handle to one fold's artefacts. fold_label is either a numeric year string ("2020") or the literal "production". Path layout:

{run_dir}/models/{experiment_key}/{fold_label}/
  detrender.pkl, feature_fill_values.parquet, model.{ridge|pca_ridge|xgboost}, bias_corrector.pkl (optional)
{run_dir}/preds/{experiment_key}/{fold_label}/
  train_preds.parquet, walk_forward_preds.parquet, year_data.parquet

cutoff property: date(int(fold_label), 1, 1) for numeric folds; sentinel for "production".

ForecastSlice (lib/results/results_slice.py:299, identity: (run_dir, commodity, season_year, init_date)) Lazy handle to one in-season forecast. Reuses the production HindcastSlice's model; splices its own indices + features. Path layout:

{run_dir}/forecast/{init_date}/
  indices.zarr, features/pred.parquet, postprocessed_{init_date}.parquet
  Treefera_{experiment_key}_{ADM}_Forecast_{init_date}.csv

Invariant: every ForecastSlice references the same production HindcastSlice; cannot exist without it.

HindcastDelivery → DeliveryRow (delivery/schemas.py) HindcastDelivery is a validated list[DeliveryRow] for a single commodity × ADM level. DeliveryRow is one CSV row: identity (commodity, year, init_date, geo_identifier, variable, model), prediction (mean), benchmarks (nass_actual, nass_actual_area_weighted_all, nass_actual_prod_div_area_all, wasde_in_season), corrections (weather_correction_bu_ac), CI bands (lower_{50,68,80,90,95}, upper_{50,68,80,90,95}). Invariants: CI ordering holds; no duplicate (year, init_date, geo_identifier); equal init_date count per (year, geo).

EditRule (lib/edit_and_imputation/edit.py:361, discriminated union on kind) Four rule types: RatioEditRule, RangeEditRule, NullImputeRule, PanelNullImputeRule. Six operations: deductive_impute, clip, flag, drop, fail, panel_trailing_median. Rules apply sequentially in YAML order; fires accumulate in EditReport.

Check (run/preflight.py:20) Value object: (name, passed, message, critical). Critical failures abort via SystemExit. run_preflight() (run/preflight.py:42) iterates checks; non-critical fails log WARNING; critical fails log ERROR and raise SystemExit.

Key invariants (DOMAIN_MODEL2.md §4.6, DOMAIN_MODEL.md §6)¶

13 named invariants: 1. Geo identifier — always lowercase ADM0:usa/ADM1:{state}/ADM2:{county}. 2. Temporal pairing — season_doy only meaningful with (season_year, commodity); no crosses_year boolean. 3. Units — kg/ha internally everywhere; conversion only at input/output boundaries. Explicit _kg_ha / _ha / _kg / _bu_ac suffixes required. 4. Column name stability — same quantity keeps same column name across all stages (sim_yield_kg_ha end-to-end). 5. Unweighted mean of yields is forbidden — must weight by area. 6. CI ordering — lower_95 ≤ lower_90 ≤ … ≤ mean ≤ … ≤ upper_95. 7. Yield range — mean within per-commodity YIELD_RANGE bounds. 8. Fold consistency — every (year, geo) group in delivery has the same init_date count. 9. included_geo_identifiers — required kwarg at every level; never optional, never falls back to test-fold geo. 10. Config is pure data — no build_* factories or I/O on config classes. 11. Stage isolation — no stage module imports another stage's internals; ExperimentResult is a handle, never a container of computed data. 12. Forecast isolation — forecast pipeline must not mutate canonical hindcast artefacts. 13. Atomic writes — every stage writes to temp then renames; has_X flags only flip after rename.

Pipeline phases with stage-orchestrator cross-reference (DOMAIN_MODEL2.md §4.5)¶

Eight stage-orchestrator modules under stages/:

Module	Public entry	Role
`stages/run_features.py`	`preprocess_data`	feature-matrix preprocessor utility
`stages/run_fit.py`	`train`	per-fold FIT kernel
`stages/run_hindcast.py`	`run`, `fit_production`	walk-forward hindcast workflow
`stages/run_predict.py`	`predict`, `write_walk_forward_outputs`, `run_predict`	atomic point-in-time PREDICT kernel
`stages/run_forecast.py`	`run_features`, `run_predict`, `run`	forecast workflow
`stages/run_meta_models.py`	`postprocess_experiment`	POSTPROCESS
`stages/run_diagnostics.py`	`evaluate_experiment`	EVALUATE
`stages/run_deliver.py`	`deliver_experiment`	DELIVER hindcast CSVs

Slice abstraction (DOMAIN_MODEL2.md §7)¶

The slice invariant: "A slice is the single entry point for everything relevant to processing one addressable portion of the pipeline on disk. It exposes paths and loaders for every artefact involved — whether that artefact is unique to the slice or pointed-to from a shared location at run_dir/."

AbstractSlice protocol (lib/results/results_slice.py:73) is @runtime_checkable and exposes: run_dir, cutoff, features_fit_path, features_pred_path, walk_forward_preds_path, year_data_path, loaders (load_walk_forward_preds, load_year_data, load_model, load_detrender, load_feature_fill_values), bias_corrector_path, has_bias_corrector.

Ownership matrix:

Class	Cutoff identity	Owns (unique to slice)	Points at (shared, run-level)
`HindcastSlice`	`fold_label` (`"YYYY"` or `"production"`)	model_path, detrender, fill_values, bias_corrector, preds	`features_fit_path`, `features_pred_path`
`ForecastSlice`	`(season_year, init_date)`	indices_zarr, per-init features, walk_forward_preds, postprocessed_national, delivery CSVs	`training → ExperimentResult.production` model/detrender/fill-values

Feature matrix location: cfg.features_dir / {experiment_key} / (i.e. data_root/features/{experiment_key}/), not run_dir/features/. Forecast per-init splice lives inside the slice at run_dir/forecast/{init_date}/features/pred.parquet.

ExperimentResult.from_run_dir mental model: "A run_dir holds one ExperimentResult; that result contains an optional set of HindcastSlices (past cutoffs with ground truth — producers of trained artefacts) and an optional set of ForecastSlices (present/future cutoffs — consumers of those trained artefacts). Every slice satisfies the AbstractSlice protocol."

Package import DAG (DOMAIN_MODEL2.md §8)¶

Eight layers (root to leaf):

Layer	Contents
L1 root	`config.py` (`ExperimentConfig` + nested config classes)
L2 pure utilities	`lib/path_utils.py`, `lib/transform_utils.py`, `lib/unit_utils.py`
L3 cross-cutting helpers	`lib/tracking/`, `lib/reference_data/`, `lib/edit_and_imputation/`, `lib/geo/`
L4 aggregate-root + slices	`lib/results/` (`results_slice.py` → `run_result.py`)
L5 domain services	`features/`, `models/`, `delivery/`, `diagnostics/`
L6 execution-frame	`run/` (preflight, experiment_protocol, runner)
L7 stage orchestration	`stages/run_*.py`
L8 entry points (leaves)	`cli.py`, `app/`

Layers rule: "A module SHALL only import from layers closer to the root, with one explicit exception: stages/ modules MAY compose sibling stages/ modules." Runtime cycles are forbidden; if TYPE_CHECKING: blocks permitted only where a signature would not otherwise resolve.

Cycle audit: no runtime cycles. Two if TYPE_CHECKING: blocks: cli.py:32-33 (ExperimentConfig), lib/results/results_slice.py:32-37 (AbstractDetrend, AbstractRegressionImpl). Two known upward edges from delivery/ into the orchestrator layer are tracked as cleanup tech-debt (not cycles).

Open questions (DOMAIN_MODEL.md §9)¶

Topic	Question
`marketing_year`	Should be collapsed into `season_year` — currently a parallel concept (WASDE Oct-Sep period).
Layering tech-debt	`delivery/conversions.py` imports conformal helpers from `stages/run_meta_models.py`; should move to `lib/`.
Forecast resumption	No per-fold checkpoint for hindcast walk-forward. Is per-fold resumption worth the complexity?
Custom exception hierarchy	Codebase uses only stdlib `ValueError`/`SystemExit`.
MLflow per-fold runs	Currently one MLflow run per pipeline invocation. Should folds be sub-runs?
Sub-commodity wheat	Preprocessor only emits `WHEAT`; sub-types in config are not produced.

Schema.yaml and gen_linkml_schema.py¶

schema.yaml is the machine-readable structural export: 42 classes, 21 enums. Regenerated by:

bash market_insights_models/src/commodity_hindcast/domain-modelling/regen_schema.sh

Sub-packages skipped during generation: app/ (Streamlit import side-effects), tests/, scripts/. Literal[...] annotations are promoted to inline LinkML enums.

gen_linkml_schema.py is a codebase-agnostic CLI tool (also shipped by the /domain-modeling skill at ~/.claude/skills/domain-modeling/scripts/gen_linkml_schema.py). CLI flags: --package, --skip, --set-env, --schema-id, --output. Determinism guaranteed: same source → byte-identical YAML via sorted keys.

Downstream codegen from schema.yaml: JSON Schema (gen-json-schema), Markdown docs (gen-markdown), ER diagram (gen-erdiagram). Recommended CI: regen produces non-empty diff → fail the build.

Notable claims (the load-bearing ones)¶

GeoIdentifier is a NewType("GeoIdentifier", str) alias — not a class. At runtime it is a plain str matching ^ADM0:[a-z0-9]+(/ADM1:.+(/ADM2:.+)?)?$. The ADM level is inferred from the prefix when needed; never stored as a separate field.
ExperimentConfig has no run_dir field — only run_dir_base; run_dir is owned by ExperimentResult and constructed from run_dir_base / experiment_name / <timestamp>.
ExperimentResult is a frozen dataclass handle — it carries NO computed results; disk is the contract.
fold_label = "production" is the no-holdout fit — canonical string, not a numeric year.
AbstractSlice is @runtime_checkable — consumer type annotations should be widened to AbstractSlice only when the body touches protocol-surface members exclusively.
The canonical cutoff term is cutoff (from Nixtla/Prophet/FPP3), not window or split. fold_label is literally a cutoff year.
schema.yaml must never be hand-edited — only regen_schema.sh writes it.
marketing_year is an acknowledged imperfect concept — tracked as an open question for collapse into season_year.
ForecastSlice cannot exist without a production HindcastSlice — this is an aggregate invariant.
Feature matrix lives at data_root/features/{experiment_key}/, not under run_dir — slices reach it via _load_config(run_dir).

What this document is NOT¶

The in-package domain model does not describe the CLI surface in full (that is README.md), the EARS-format design contracts (that is DESIGN.md), or the backlog (that is TODO.md). The schema.yaml does not carry editorial meaning — only structure.

Cross-references¶

DESIGN.md — EARS clauses that encode the invariants as requirements
README.md — operator guide including the authoritative run_dir layout
features_README.md — assemble orchestrator that writes fit.parquet, pred.parquet, metadata.json
features_builders_README.md — builder protocol satisfying the (year, geo_identifier, init_date) key contract