Source: In-Package Domain Model¶
What it is¶
Four co-located files under market_insights_models/src/commodity_hindcast/domain-modelling/:
DOMAIN_MODEL2.md(v2, 865 lines, 2026-04-29) — the editorial domain model: ubiquitous language, bounded contexts, entity catalogue, relationships, aggregates, pipeline phases with stage-orchestrator cross-reference, key invariants, walk-through scenario, slice abstraction,AbstractSliceprotocol, and package import DAG. This is the canonical human-maintained reference.DOMAIN_MODEL.md(v1 curated rewrite, 729 lines) — an earlier version organised into 9 numbered sections (Overview, Glossary, Bounded Contexts, Entity Catalogue, Relationships, Aggregates & Invariants, Lifecycles, Scenarios, Open Questions). Still load-bearing: it adds Scenarios §8, Open Questions §9, andExperimentConfig.from_yaml/run_dir_basenuances not duplicated in DOMAIN_MODEL2.md.schema.yaml— auto-generated LinkML schema: 42 classes, 21 enums, 0 import failures. Machine-readable structural export. Regenerated viaregen_schema.sh; never hand-edited.gen_linkml_schema.py— the codebase-agnostic generator script. Walks every importable module under a package viapkgutil.walk_packages, discovers PydanticBaseModelsubclasses, dataclasses, andEnumsubclasses throughinspectintrospection, and emits LinkML YAML viapyyaml. Deterministic: same source → byte-identical YAML.
Source-of-truth split (from DOMAIN_MODEL.md Appendix B):
| Layer | Source of truth |
|---|---|
| Structure (classes, attributes, types, cardinality) | The Python source — commodity_hindcast/**/*.py |
| Editorial (role, bounded context, identifier, invariants, scenarios, glossary) | DOMAIN_MODEL2.md and DOMAIN_MODEL.md |
| Machine-readable export | schema.yaml (auto-generated, never hand-edited) |
"If the Markdown and the YAML disagree about structure, the YAML is right (regenerate the doc); if they disagree about meaning, the Markdown is right (it's the only place meaning is asserted)."
Section-by-section summary¶
Ubiquitous language (DOMAIN_MODEL2.md §1)¶
Complete glossary drawn from DESIGN.md, READMEs, and canonical type names. Key groups:
Temporal vocabulary
| Term | Meaning |
|---|---|
season_year |
Crop year label. Paired with season_doy + commodity to locate any point in a growing season. |
season_doy |
Integer day-offset from season start. Can exceed 366 for cross-year crops (winter wheat). Distinct from calendar_doy. |
init_date |
Specific calendar date on which a within-season forecast is issued. Features known up to init_date − lag_days. |
harvest_date |
Calendar date on which season_doy == harvest_season_doy. Marks end-of-harvest features for fit.parquet. |
gstd |
"growing-season-to-date" — accumulators that reset at season start. Never ytd. |
MonthDay |
Recurring calendar day (month, day), year-free value object. |
SeasonWindow |
Named aggregation window (name, sdoy_start, sdoy_end). |
Geographic vocabulary
| Term | Meaning |
|---|---|
geo_identifier |
Lowercase ADM path ADM0:usa/ADM1:{state}/ADM2:{county}. The one canonical identifier — no FIPS codes, no mixed case. lib/geo/identifiers.py. |
AggregationLevel |
Literal["ADM0", "ADM1", "ADM2"]. |
included_geo_identifiers |
frozenset[str] of modelled counties (top 95% by production). Required kwarg threaded through eval chain. |
Data-shape vocabulary
| Term | Meaning |
|---|---|
Builder |
Plug-in module reading a source into a parquet keyed by (year, geo_identifier, init_date). |
fit.parquet |
End-of-harvest features — one row per (geo, year), target known. |
pred.parquet |
In-season features — all init_dates, target unknown/lagged. |
walk_forward_preds.parquet |
County × init_date simulated yields for a single fold. Same schema across folds. |
fold_label |
CV fold identifier. Numeric year ("2020") or literal "production". |
run_dir |
On-disk artefact root. Sole hand-off contract between stages. |
metadata.json |
Sidecar carrying index_cols, feature_cols, target_col. |
Unit vocabulary (canonical)
Internal storage: yield_kg_ha, area_harvested_ha, production_kg. Delivery boundary only: yield_bu_ac, yield_lbs_ac. "Columns without a unit suffix are forbidden."
Lifecycle vocabulary — four stages
| Stage | Responsibility |
|---|---|
| FIT | Detrend → impute → regress → save artefacts + train_preds + walk_forward_preds. Zero metrics, zero plots. |
| POSTPROCESS | Aggregate to national → fit bias corrector → attach conformal intervals. |
| EVALUATE | Compute metrics, generate plots. Read-only consumer. |
| DELIVER | Emit client-facing CSV per ADM level from walk_forward_preds + postprocessed. |
Bounded contexts (DOMAIN_MODEL2.md §2, DOMAIN_MODEL.md §3)¶
Seven contexts:
| Context | Aggregate Roots |
|---|---|
| Configuration | ExperimentConfig, CommodityConfig |
| Feature Assembly | Builder (+ registry), metadata.json |
| Experiment / Modelling | ExperimentResult (root) → HindcastSlice |
| Post-processing | bias_corrector, conformal intervals |
| Delivery | HindcastDelivery → DeliveryRow |
| Forecast | ForecastSlice under ExperimentResult |
| Tracking / Preflight | MLflow run, Check |
Entity catalogue (DOMAIN_MODEL2.md §4.1–4.10, DOMAIN_MODEL.md §4)¶
ExperimentConfig (config.py, identity: experiment_name)
The top-level settings root loaded via pydantic-settings (YAML > env > CLI). Has no run_dir field — only run_dir_base; run_dir is owned by ExperimentResult. Resolved data_root comes from INPUT_DATA_DIR env var via require_input_data_dir(). Invariants: experiment_name matches [a-zA-Z0-9_-]+; all ResolvablePath fields resolve under data_root; forecast.init_date required when forecast is set; output dirs created on resolve.
CommodityConfig (config.py:284, identity: commodity)
Source of truth for all commodity-specific constants: calendar, builders, feature/target columns, unit weights. Key fields: season_start (MonthDay), harvest_season_doy, hindcast_init_season_doys, bushel_weight_lbs, delivery_unit, yield_range, feature_cols/target_col, builders dict, climo_windows/weather_windows.
Builder (config.py:243, discriminated union on type)
Five variants: YieldsBuilder, WeatherBuilder, ClimoBuilder, NDVIBuilder, StressBuilder. Common base (BaseBuilderConfig, config.py:161): filepath, geo_id_col, required_for_pred_parquet.
ExperimentResult (lib/results/run_result.py:32, identity: run_dir)
Aggregate root — frozen dataclass that lazily discovers fold artefacts on disk. Fields: config, hindcast_slices: tuple[HindcastSlice, ...], forecast_slices: tuple[ForecastSlice, ...], run_dir. Constructor: from_run_dir(run_dir). "No in-memory cache — readers are responsible for materialising." Invariant: if forecast_slices non-empty, must contain a production hindcast slice.
HindcastSlice (lib/results/results_slice.py:112, identity: (run_dir, fold_label))
Lazy handle to one fold's artefacts. fold_label is either a numeric year string ("2020") or the literal "production". Path layout:
{run_dir}/models/{experiment_key}/{fold_label}/
detrender.pkl, feature_fill_values.parquet, model.{ridge|pca_ridge|xgboost}, bias_corrector.pkl (optional)
{run_dir}/preds/{experiment_key}/{fold_label}/
train_preds.parquet, walk_forward_preds.parquet, year_data.parquet
cutoff property: date(int(fold_label), 1, 1) for numeric folds; sentinel for "production".
ForecastSlice (lib/results/results_slice.py:299, identity: (run_dir, commodity, season_year, init_date))
Lazy handle to one in-season forecast. Reuses the production HindcastSlice's model; splices its own indices + features. Path layout:
{run_dir}/forecast/{init_date}/
indices.zarr, features/pred.parquet, postprocessed_{init_date}.parquet
Treefera_{experiment_key}_{ADM}_Forecast_{init_date}.csv
ForecastSlice references the same production HindcastSlice; cannot exist without it.
HindcastDelivery → DeliveryRow (delivery/schemas.py)
HindcastDelivery is a validated list[DeliveryRow] for a single commodity × ADM level. DeliveryRow is one CSV row: identity (commodity, year, init_date, geo_identifier, variable, model), prediction (mean), benchmarks (nass_actual, nass_actual_area_weighted_all, nass_actual_prod_div_area_all, wasde_in_season), corrections (weather_correction_bu_ac), CI bands (lower_{50,68,80,90,95}, upper_{50,68,80,90,95}). Invariants: CI ordering holds; no duplicate (year, init_date, geo_identifier); equal init_date count per (year, geo).
EditRule (lib/edit_and_imputation/edit.py:361, discriminated union on kind)
Four rule types: RatioEditRule, RangeEditRule, NullImputeRule, PanelNullImputeRule. Six operations: deductive_impute, clip, flag, drop, fail, panel_trailing_median. Rules apply sequentially in YAML order; fires accumulate in EditReport.
Check (run/preflight.py:20)
Value object: (name, passed, message, critical). Critical failures abort via SystemExit. run_preflight() (run/preflight.py:42) iterates checks; non-critical fails log WARNING; critical fails log ERROR and raise SystemExit.
Key invariants (DOMAIN_MODEL2.md §4.6, DOMAIN_MODEL.md §6)¶
13 named invariants:
1. Geo identifier — always lowercase ADM0:usa/ADM1:{state}/ADM2:{county}.
2. Temporal pairing — season_doy only meaningful with (season_year, commodity); no crosses_year boolean.
3. Units — kg/ha internally everywhere; conversion only at input/output boundaries. Explicit _kg_ha / _ha / _kg / _bu_ac suffixes required.
4. Column name stability — same quantity keeps same column name across all stages (sim_yield_kg_ha end-to-end).
5. Unweighted mean of yields is forbidden — must weight by area.
6. CI ordering — lower_95 ≤ lower_90 ≤ … ≤ mean ≤ … ≤ upper_95.
7. Yield range — mean within per-commodity YIELD_RANGE bounds.
8. Fold consistency — every (year, geo) group in delivery has the same init_date count.
9. included_geo_identifiers — required kwarg at every level; never optional, never falls back to test-fold geo.
10. Config is pure data — no build_* factories or I/O on config classes.
11. Stage isolation — no stage module imports another stage's internals; ExperimentResult is a handle, never a container of computed data.
12. Forecast isolation — forecast pipeline must not mutate canonical hindcast artefacts.
13. Atomic writes — every stage writes to temp then renames; has_X flags only flip after rename.
Pipeline phases with stage-orchestrator cross-reference (DOMAIN_MODEL2.md §4.5)¶
Eight stage-orchestrator modules under stages/:
| Module | Public entry | Role |
|---|---|---|
stages/run_features.py |
preprocess_data |
feature-matrix preprocessor utility |
stages/run_fit.py |
train |
per-fold FIT kernel |
stages/run_hindcast.py |
run, fit_production |
walk-forward hindcast workflow |
stages/run_predict.py |
predict, write_walk_forward_outputs, run_predict |
atomic point-in-time PREDICT kernel |
stages/run_forecast.py |
run_features, run_predict, run |
forecast workflow |
stages/run_meta_models.py |
postprocess_experiment |
POSTPROCESS |
stages/run_diagnostics.py |
evaluate_experiment |
EVALUATE |
stages/run_deliver.py |
deliver_experiment |
DELIVER hindcast CSVs |
Slice abstraction (DOMAIN_MODEL2.md §7)¶
The slice invariant: "A slice is the single entry point for everything relevant to processing one addressable portion of the pipeline on disk. It exposes paths and loaders for every artefact involved — whether that artefact is unique to the slice or pointed-to from a shared location at run_dir/."
AbstractSlice protocol (lib/results/results_slice.py:73) is @runtime_checkable and exposes: run_dir, cutoff, features_fit_path, features_pred_path, walk_forward_preds_path, year_data_path, loaders (load_walk_forward_preds, load_year_data, load_model, load_detrender, load_feature_fill_values), bias_corrector_path, has_bias_corrector.
Ownership matrix:
| Class | Cutoff identity | Owns (unique to slice) | Points at (shared, run-level) |
|---|---|---|---|
HindcastSlice |
fold_label ("YYYY" or "production") |
model_path, detrender, fill_values, bias_corrector, preds | features_fit_path, features_pred_path |
ForecastSlice |
(season_year, init_date) |
indices_zarr, per-init features, walk_forward_preds, postprocessed_national, delivery CSVs | training → ExperimentResult.production model/detrender/fill-values |
Feature matrix location: cfg.features_dir / {experiment_key} / (i.e. data_root/features/{experiment_key}/), not run_dir/features/. Forecast per-init splice lives inside the slice at run_dir/forecast/{init_date}/features/pred.parquet.
ExperimentResult.from_run_dir mental model: "A run_dir holds one ExperimentResult; that result contains an optional set of HindcastSlices (past cutoffs with ground truth — producers of trained artefacts) and an optional set of ForecastSlices (present/future cutoffs — consumers of those trained artefacts). Every slice satisfies the AbstractSlice protocol."
Package import DAG (DOMAIN_MODEL2.md §8)¶
Eight layers (root to leaf):
| Layer | Contents |
|---|---|
| L1 root | config.py (ExperimentConfig + nested config classes) |
| L2 pure utilities | lib/path_utils.py, lib/transform_utils.py, lib/unit_utils.py |
| L3 cross-cutting helpers | lib/tracking/, lib/reference_data/, lib/edit_and_imputation/, lib/geo/ |
| L4 aggregate-root + slices | lib/results/ (results_slice.py → run_result.py) |
| L5 domain services | features/, models/, delivery/, diagnostics/ |
| L6 execution-frame | run/ (preflight, experiment_protocol, runner) |
| L7 stage orchestration | stages/run_*.py |
| L8 entry points (leaves) | cli.py, app/ |
Layers rule: "A module SHALL only import from layers closer to the root, with one explicit exception: stages/ modules MAY compose sibling stages/ modules." Runtime cycles are forbidden; if TYPE_CHECKING: blocks permitted only where a signature would not otherwise resolve.
Cycle audit: no runtime cycles. Two if TYPE_CHECKING: blocks: cli.py:32-33 (ExperimentConfig), lib/results/results_slice.py:32-37 (AbstractDetrend, AbstractRegressionImpl). Two known upward edges from delivery/ into the orchestrator layer are tracked as cleanup tech-debt (not cycles).
Open questions (DOMAIN_MODEL.md §9)¶
| Topic | Question |
|---|---|
marketing_year |
Should be collapsed into season_year — currently a parallel concept (WASDE Oct-Sep period). |
| Layering tech-debt | delivery/conversions.py imports conformal helpers from stages/run_meta_models.py; should move to lib/. |
| Forecast resumption | No per-fold checkpoint for hindcast walk-forward. Is per-fold resumption worth the complexity? |
| Custom exception hierarchy | Codebase uses only stdlib ValueError/SystemExit. |
| MLflow per-fold runs | Currently one MLflow run per pipeline invocation. Should folds be sub-runs? |
| Sub-commodity wheat | Preprocessor only emits WHEAT; sub-types in config are not produced. |
Schema.yaml and gen_linkml_schema.py¶
schema.yaml is the machine-readable structural export: 42 classes, 21 enums. Regenerated by:
Sub-packages skipped during generation: app/ (Streamlit import side-effects), tests/, scripts/. Literal[...] annotations are promoted to inline LinkML enums.
gen_linkml_schema.py is a codebase-agnostic CLI tool (also shipped by the /domain-modeling skill at ~/.claude/skills/domain-modeling/scripts/gen_linkml_schema.py). CLI flags: --package, --skip, --set-env, --schema-id, --output. Determinism guaranteed: same source → byte-identical YAML via sorted keys.
Downstream codegen from schema.yaml: JSON Schema (gen-json-schema), Markdown docs (gen-markdown), ER diagram (gen-erdiagram). Recommended CI: regen produces non-empty diff → fail the build.
Notable claims (the load-bearing ones)¶
GeoIdentifieris aNewType("GeoIdentifier", str)alias — not a class. At runtime it is a plainstrmatching^ADM0:[a-z0-9]+(/ADM1:.+(/ADM2:.+)?)?$. The ADM level is inferred from the prefix when needed; never stored as a separate field.ExperimentConfighas norun_dirfield — onlyrun_dir_base;run_diris owned byExperimentResultand constructed fromrun_dir_base / experiment_name / <timestamp>.ExperimentResultis a frozen dataclass handle — it carries NO computed results; disk is the contract.fold_label = "production"is the no-holdout fit — canonical string, not a numeric year.AbstractSliceis@runtime_checkable— consumer type annotations should be widened toAbstractSliceonly when the body touches protocol-surface members exclusively.- The canonical cutoff term is
cutoff(from Nixtla/Prophet/FPP3), notwindoworsplit.fold_labelis literally a cutoff year. schema.yamlmust never be hand-edited — onlyregen_schema.shwrites it.marketing_yearis an acknowledged imperfect concept — tracked as an open question for collapse intoseason_year.ForecastSlicecannot exist without aproductionHindcastSlice— this is an aggregate invariant.- Feature matrix lives at
data_root/features/{experiment_key}/, not underrun_dir— slices reach it via_load_config(run_dir).
What this document is NOT¶
The in-package domain model does not describe the CLI surface in full (that is README.md), the EARS-format design contracts (that is DESIGN.md), or the backlog (that is TODO.md). The schema.yaml does not carry editorial meaning — only structure.
Cross-references¶
- DESIGN.md — EARS clauses that encode the invariants as requirements
- README.md — operator guide including the authoritative
run_dirlayout - features_README.md —
assembleorchestrator that writesfit.parquet,pred.parquet,metadata.json - features_builders_README.md — builder protocol satisfying the
(year, geo_identifier, init_date)key contract