Skip to content

Source: In-Package Domain Model

What it is

Four co-located files under market_insights_models/src/commodity_hindcast/domain-modelling/:

  • DOMAIN_MODEL2.md (v2, 865 lines, 2026-04-29) — the editorial domain model: ubiquitous language, bounded contexts, entity catalogue, relationships, aggregates, pipeline phases with stage-orchestrator cross-reference, key invariants, walk-through scenario, slice abstraction, AbstractSlice protocol, and package import DAG. This is the canonical human-maintained reference.
  • DOMAIN_MODEL.md (v1 curated rewrite, 729 lines) — an earlier version organised into 9 numbered sections (Overview, Glossary, Bounded Contexts, Entity Catalogue, Relationships, Aggregates & Invariants, Lifecycles, Scenarios, Open Questions). Still load-bearing: it adds Scenarios §8, Open Questions §9, and ExperimentConfig.from_yaml / run_dir_base nuances not duplicated in DOMAIN_MODEL2.md.
  • schema.yaml — auto-generated LinkML schema: 42 classes, 21 enums, 0 import failures. Machine-readable structural export. Regenerated via regen_schema.sh; never hand-edited.
  • gen_linkml_schema.py — the codebase-agnostic generator script. Walks every importable module under a package via pkgutil.walk_packages, discovers Pydantic BaseModel subclasses, dataclasses, and Enum subclasses through inspect introspection, and emits LinkML YAML via pyyaml. Deterministic: same source → byte-identical YAML.

Source-of-truth split (from DOMAIN_MODEL.md Appendix B):

Layer Source of truth
Structure (classes, attributes, types, cardinality) The Python source — commodity_hindcast/**/*.py
Editorial (role, bounded context, identifier, invariants, scenarios, glossary) DOMAIN_MODEL2.md and DOMAIN_MODEL.md
Machine-readable export schema.yaml (auto-generated, never hand-edited)

"If the Markdown and the YAML disagree about structure, the YAML is right (regenerate the doc); if they disagree about meaning, the Markdown is right (it's the only place meaning is asserted)."

Section-by-section summary

Ubiquitous language (DOMAIN_MODEL2.md §1)

Complete glossary drawn from DESIGN.md, READMEs, and canonical type names. Key groups:

Temporal vocabulary

Term Meaning
season_year Crop year label. Paired with season_doy + commodity to locate any point in a growing season.
season_doy Integer day-offset from season start. Can exceed 366 for cross-year crops (winter wheat). Distinct from calendar_doy.
init_date Specific calendar date on which a within-season forecast is issued. Features known up to init_date − lag_days.
harvest_date Calendar date on which season_doy == harvest_season_doy. Marks end-of-harvest features for fit.parquet.
gstd "growing-season-to-date" — accumulators that reset at season start. Never ytd.
MonthDay Recurring calendar day (month, day), year-free value object.
SeasonWindow Named aggregation window (name, sdoy_start, sdoy_end).

Geographic vocabulary

Term Meaning
geo_identifier Lowercase ADM path ADM0:usa/ADM1:{state}/ADM2:{county}. The one canonical identifier — no FIPS codes, no mixed case. lib/geo/identifiers.py.
AggregationLevel Literal["ADM0", "ADM1", "ADM2"].
included_geo_identifiers frozenset[str] of modelled counties (top 95% by production). Required kwarg threaded through eval chain.

Data-shape vocabulary

Term Meaning
Builder Plug-in module reading a source into a parquet keyed by (year, geo_identifier, init_date).
fit.parquet End-of-harvest features — one row per (geo, year), target known.
pred.parquet In-season features — all init_dates, target unknown/lagged.
walk_forward_preds.parquet County × init_date simulated yields for a single fold. Same schema across folds.
fold_label CV fold identifier. Numeric year ("2020") or literal "production".
run_dir On-disk artefact root. Sole hand-off contract between stages.
metadata.json Sidecar carrying index_cols, feature_cols, target_col.

Unit vocabulary (canonical)

Internal storage: yield_kg_ha, area_harvested_ha, production_kg. Delivery boundary only: yield_bu_ac, yield_lbs_ac. "Columns without a unit suffix are forbidden."

Lifecycle vocabulary — four stages

Stage Responsibility
FIT Detrend → impute → regress → save artefacts + train_preds + walk_forward_preds. Zero metrics, zero plots.
POSTPROCESS Aggregate to national → fit bias corrector → attach conformal intervals.
EVALUATE Compute metrics, generate plots. Read-only consumer.
DELIVER Emit client-facing CSV per ADM level from walk_forward_preds + postprocessed.

Bounded contexts (DOMAIN_MODEL2.md §2, DOMAIN_MODEL.md §3)

Seven contexts:

Context Aggregate Roots
Configuration ExperimentConfig, CommodityConfig
Feature Assembly Builder (+ registry), metadata.json
Experiment / Modelling ExperimentResult (root) → HindcastSlice
Post-processing bias_corrector, conformal intervals
Delivery HindcastDeliveryDeliveryRow
Forecast ForecastSlice under ExperimentResult
Tracking / Preflight MLflow run, Check

Entity catalogue (DOMAIN_MODEL2.md §4.1–4.10, DOMAIN_MODEL.md §4)

ExperimentConfig (config.py, identity: experiment_name) The top-level settings root loaded via pydantic-settings (YAML > env > CLI). Has no run_dir field — only run_dir_base; run_dir is owned by ExperimentResult. Resolved data_root comes from INPUT_DATA_DIR env var via require_input_data_dir(). Invariants: experiment_name matches [a-zA-Z0-9_-]+; all ResolvablePath fields resolve under data_root; forecast.init_date required when forecast is set; output dirs created on resolve.

CommodityConfig (config.py:284, identity: commodity) Source of truth for all commodity-specific constants: calendar, builders, feature/target columns, unit weights. Key fields: season_start (MonthDay), harvest_season_doy, hindcast_init_season_doys, bushel_weight_lbs, delivery_unit, yield_range, feature_cols/target_col, builders dict, climo_windows/weather_windows.

Builder (config.py:243, discriminated union on type) Five variants: YieldsBuilder, WeatherBuilder, ClimoBuilder, NDVIBuilder, StressBuilder. Common base (BaseBuilderConfig, config.py:161): filepath, geo_id_col, required_for_pred_parquet.

ExperimentResult (lib/results/run_result.py:32, identity: run_dir) Aggregate root — frozen dataclass that lazily discovers fold artefacts on disk. Fields: config, hindcast_slices: tuple[HindcastSlice, ...], forecast_slices: tuple[ForecastSlice, ...], run_dir. Constructor: from_run_dir(run_dir). "No in-memory cache — readers are responsible for materialising." Invariant: if forecast_slices non-empty, must contain a production hindcast slice.

HindcastSlice (lib/results/results_slice.py:112, identity: (run_dir, fold_label)) Lazy handle to one fold's artefacts. fold_label is either a numeric year string ("2020") or the literal "production". Path layout:

{run_dir}/models/{experiment_key}/{fold_label}/
  detrender.pkl, feature_fill_values.parquet, model.{ridge|pca_ridge|xgboost}, bias_corrector.pkl (optional)
{run_dir}/preds/{experiment_key}/{fold_label}/
  train_preds.parquet, walk_forward_preds.parquet, year_data.parquet
cutoff property: date(int(fold_label), 1, 1) for numeric folds; sentinel for "production".

ForecastSlice (lib/results/results_slice.py:299, identity: (run_dir, commodity, season_year, init_date)) Lazy handle to one in-season forecast. Reuses the production HindcastSlice's model; splices its own indices + features. Path layout:

{run_dir}/forecast/{init_date}/
  indices.zarr, features/pred.parquet, postprocessed_{init_date}.parquet
  Treefera_{experiment_key}_{ADM}_Forecast_{init_date}.csv
Invariant: every ForecastSlice references the same production HindcastSlice; cannot exist without it.

HindcastDeliveryDeliveryRow (delivery/schemas.py) HindcastDelivery is a validated list[DeliveryRow] for a single commodity × ADM level. DeliveryRow is one CSV row: identity (commodity, year, init_date, geo_identifier, variable, model), prediction (mean), benchmarks (nass_actual, nass_actual_area_weighted_all, nass_actual_prod_div_area_all, wasde_in_season), corrections (weather_correction_bu_ac), CI bands (lower_{50,68,80,90,95}, upper_{50,68,80,90,95}). Invariants: CI ordering holds; no duplicate (year, init_date, geo_identifier); equal init_date count per (year, geo).

EditRule (lib/edit_and_imputation/edit.py:361, discriminated union on kind) Four rule types: RatioEditRule, RangeEditRule, NullImputeRule, PanelNullImputeRule. Six operations: deductive_impute, clip, flag, drop, fail, panel_trailing_median. Rules apply sequentially in YAML order; fires accumulate in EditReport.

Check (run/preflight.py:20) Value object: (name, passed, message, critical). Critical failures abort via SystemExit. run_preflight() (run/preflight.py:42) iterates checks; non-critical fails log WARNING; critical fails log ERROR and raise SystemExit.

Key invariants (DOMAIN_MODEL2.md §4.6, DOMAIN_MODEL.md §6)

13 named invariants: 1. Geo identifier — always lowercase ADM0:usa/ADM1:{state}/ADM2:{county}. 2. Temporal pairing — season_doy only meaningful with (season_year, commodity); no crosses_year boolean. 3. Units — kg/ha internally everywhere; conversion only at input/output boundaries. Explicit _kg_ha / _ha / _kg / _bu_ac suffixes required. 4. Column name stability — same quantity keeps same column name across all stages (sim_yield_kg_ha end-to-end). 5. Unweighted mean of yields is forbidden — must weight by area. 6. CI ordering — lower_95 ≤ lower_90 ≤ … ≤ mean ≤ … ≤ upper_95. 7. Yield range — mean within per-commodity YIELD_RANGE bounds. 8. Fold consistency — every (year, geo) group in delivery has the same init_date count. 9. included_geo_identifiers — required kwarg at every level; never optional, never falls back to test-fold geo. 10. Config is pure data — no build_* factories or I/O on config classes. 11. Stage isolation — no stage module imports another stage's internals; ExperimentResult is a handle, never a container of computed data. 12. Forecast isolation — forecast pipeline must not mutate canonical hindcast artefacts. 13. Atomic writes — every stage writes to temp then renames; has_X flags only flip after rename.

Pipeline phases with stage-orchestrator cross-reference (DOMAIN_MODEL2.md §4.5)

Eight stage-orchestrator modules under stages/:

Module Public entry Role
stages/run_features.py preprocess_data feature-matrix preprocessor utility
stages/run_fit.py train per-fold FIT kernel
stages/run_hindcast.py run, fit_production walk-forward hindcast workflow
stages/run_predict.py predict, write_walk_forward_outputs, run_predict atomic point-in-time PREDICT kernel
stages/run_forecast.py run_features, run_predict, run forecast workflow
stages/run_meta_models.py postprocess_experiment POSTPROCESS
stages/run_diagnostics.py evaluate_experiment EVALUATE
stages/run_deliver.py deliver_experiment DELIVER hindcast CSVs

Slice abstraction (DOMAIN_MODEL2.md §7)

The slice invariant: "A slice is the single entry point for everything relevant to processing one addressable portion of the pipeline on disk. It exposes paths and loaders for every artefact involved — whether that artefact is unique to the slice or pointed-to from a shared location at run_dir/."

AbstractSlice protocol (lib/results/results_slice.py:73) is @runtime_checkable and exposes: run_dir, cutoff, features_fit_path, features_pred_path, walk_forward_preds_path, year_data_path, loaders (load_walk_forward_preds, load_year_data, load_model, load_detrender, load_feature_fill_values), bias_corrector_path, has_bias_corrector.

Ownership matrix:

Class Cutoff identity Owns (unique to slice) Points at (shared, run-level)
HindcastSlice fold_label ("YYYY" or "production") model_path, detrender, fill_values, bias_corrector, preds features_fit_path, features_pred_path
ForecastSlice (season_year, init_date) indices_zarr, per-init features, walk_forward_preds, postprocessed_national, delivery CSVs training → ExperimentResult.production model/detrender/fill-values

Feature matrix location: cfg.features_dir / {experiment_key} / (i.e. data_root/features/{experiment_key}/), not run_dir/features/. Forecast per-init splice lives inside the slice at run_dir/forecast/{init_date}/features/pred.parquet.

ExperimentResult.from_run_dir mental model: "A run_dir holds one ExperimentResult; that result contains an optional set of HindcastSlices (past cutoffs with ground truth — producers of trained artefacts) and an optional set of ForecastSlices (present/future cutoffs — consumers of those trained artefacts). Every slice satisfies the AbstractSlice protocol."

Package import DAG (DOMAIN_MODEL2.md §8)

Eight layers (root to leaf):

Layer Contents
L1 root config.py (ExperimentConfig + nested config classes)
L2 pure utilities lib/path_utils.py, lib/transform_utils.py, lib/unit_utils.py
L3 cross-cutting helpers lib/tracking/, lib/reference_data/, lib/edit_and_imputation/, lib/geo/
L4 aggregate-root + slices lib/results/ (results_slice.pyrun_result.py)
L5 domain services features/, models/, delivery/, diagnostics/
L6 execution-frame run/ (preflight, experiment_protocol, runner)
L7 stage orchestration stages/run_*.py
L8 entry points (leaves) cli.py, app/

Layers rule: "A module SHALL only import from layers closer to the root, with one explicit exception: stages/ modules MAY compose sibling stages/ modules." Runtime cycles are forbidden; if TYPE_CHECKING: blocks permitted only where a signature would not otherwise resolve.

Cycle audit: no runtime cycles. Two if TYPE_CHECKING: blocks: cli.py:32-33 (ExperimentConfig), lib/results/results_slice.py:32-37 (AbstractDetrend, AbstractRegressionImpl). Two known upward edges from delivery/ into the orchestrator layer are tracked as cleanup tech-debt (not cycles).

Open questions (DOMAIN_MODEL.md §9)

Topic Question
marketing_year Should be collapsed into season_year — currently a parallel concept (WASDE Oct-Sep period).
Layering tech-debt delivery/conversions.py imports conformal helpers from stages/run_meta_models.py; should move to lib/.
Forecast resumption No per-fold checkpoint for hindcast walk-forward. Is per-fold resumption worth the complexity?
Custom exception hierarchy Codebase uses only stdlib ValueError/SystemExit.
MLflow per-fold runs Currently one MLflow run per pipeline invocation. Should folds be sub-runs?
Sub-commodity wheat Preprocessor only emits WHEAT; sub-types in config are not produced.

Schema.yaml and gen_linkml_schema.py

schema.yaml is the machine-readable structural export: 42 classes, 21 enums. Regenerated by:

bash market_insights_models/src/commodity_hindcast/domain-modelling/regen_schema.sh

Sub-packages skipped during generation: app/ (Streamlit import side-effects), tests/, scripts/. Literal[...] annotations are promoted to inline LinkML enums.

gen_linkml_schema.py is a codebase-agnostic CLI tool (also shipped by the /domain-modeling skill at ~/.claude/skills/domain-modeling/scripts/gen_linkml_schema.py). CLI flags: --package, --skip, --set-env, --schema-id, --output. Determinism guaranteed: same source → byte-identical YAML via sorted keys.

Downstream codegen from schema.yaml: JSON Schema (gen-json-schema), Markdown docs (gen-markdown), ER diagram (gen-erdiagram). Recommended CI: regen produces non-empty diff → fail the build.

Notable claims (the load-bearing ones)

  • GeoIdentifier is a NewType("GeoIdentifier", str) alias — not a class. At runtime it is a plain str matching ^ADM0:[a-z0-9]+(/ADM1:.+(/ADM2:.+)?)?$. The ADM level is inferred from the prefix when needed; never stored as a separate field.
  • ExperimentConfig has no run_dir field — only run_dir_base; run_dir is owned by ExperimentResult and constructed from run_dir_base / experiment_name / <timestamp>.
  • ExperimentResult is a frozen dataclass handle — it carries NO computed results; disk is the contract.
  • fold_label = "production" is the no-holdout fit — canonical string, not a numeric year.
  • AbstractSlice is @runtime_checkable — consumer type annotations should be widened to AbstractSlice only when the body touches protocol-surface members exclusively.
  • The canonical cutoff term is cutoff (from Nixtla/Prophet/FPP3), not window or split. fold_label is literally a cutoff year.
  • schema.yaml must never be hand-edited — only regen_schema.sh writes it.
  • marketing_year is an acknowledged imperfect concept — tracked as an open question for collapse into season_year.
  • ForecastSlice cannot exist without a production HindcastSlice — this is an aggregate invariant.
  • Feature matrix lives at data_root/features/{experiment_key}/, not under run_dir — slices reach it via _load_config(run_dir).

What this document is NOT

The in-package domain model does not describe the CLI surface in full (that is README.md), the EARS-format design contracts (that is DESIGN.md), or the backlog (that is TODO.md). The schema.yaml does not carry editorial meaning — only structure.

Cross-references

  • DESIGN.md — EARS clauses that encode the invariants as requirements
  • README.md — operator guide including the authoritative run_dir layout
  • features_README.mdassemble orchestrator that writes fit.parquet, pred.parquet, metadata.json
  • features_builders_README.md — builder protocol satisfying the (year, geo_identifier, init_date) key contract