Entity Catalogue — commodity_hindcast¶

Notes on canonical vocabulary¶

The orchestrator's seed list is accurate for most entries. Three adjustments follow from reading the code.

ForecastConfig is not a separate aggregate. It is a subordinate Pydantic model composed inside ExperimentConfig (field forecast: ForecastConfig | None). It holds only three fields (raw_obs_filepath, materialised_climo_filepath, init_date). It is listed under Tier 2 as a component config, not a root.

ReferenceYieldSpec replaces NassSpec/WasdeSpec/ConabSpec from the seed list. The actual discriminated union in lib/reference_data/loader.py:59 is WasdeRefSpec | ConabFinalRefSpec | ConabLevantamentoRefSpec — three variants, not two, and NASS is not a spec at all (it is a builder, loaded by YieldsBuilder). The seed list's NassSpec has no code backing.

ConformalExperiment does not exist in code. The seed list mentions it but the module models/meta_models/conformalise.py only defines CalibrationResult. The experiment-level entry point is a module-level function (apply_conformal) not a class. The entity is dropped; CalibrationResult is kept.

FoldSchedule belongs to the dashboard layer (app/_dashboard_config.py), not to the core domain. It is noted but not given a full Tier 5 entry; the dashboard scope is out of the pipeline domain model.

HindcastDelivery and DeliveryRow live in Tier 5. The seed list placed them there as "Delivery & validation"; that is correct and is retained.

GeoIdentifier is a NewType, not a class. It is documented as a scalar value in the domain but does not warrant an entity entry. It is captured under the value-objects note inside the Region entry.

Fold in the seed list is represented in code by fold_label: str. There is no Fold class. The concept is captured fully under HindcastSlice (which owns a fold_label) and documented here as a value-object note.

Tier 1 — Core domain entities¶

Commodity¶

Definition: An agricultural crop being modelled (corn, soybean, wheat, cotton, plus non-US extensions such as brazil_soybean). Commodity is the root discriminator for all crop-specific constants — calendar, feature columns, yield units, and plausibility bounds. In the code it manifests as the CommodityConfig.commodity: str identity field, not a separate class.

Cardinality: One instance per YAML config file loaded. Practically five active commodities (corn, soybean, wheat, cotton, brazil_soybean), each with its own configs/{commodity}_experiment.yaml.

Key attributes: commodity (str, identity), season_start (MonthDay), harvest_season_doy (int), hindcast_init_season_doys (tuple[int, …]), yield_range (tuple[float, float] in delivery units), delivery_unit (str), bushel_weight_lbs (float), feature_cols (list[str]), target_col (str).

Source citations: config.py:271 (CommodityConfig class); config.py:283 (country_code field); config.py:298 (yield_range field).

Notes: country_code (default "USA") controls the ADM0 segment of every geo_identifier minted for this commodity's features. The wheat commodity config lists sub-type labels (WINTER_WHEAT, etc.) that are not produced by the preprocessor — an open issue in DOMAIN_MODEL2.md §9.

SeasonYear¶

Definition: The crop-year label — an integer identifying the harvest year. Paired with (commodity, season_doy) it locates any point within a growing season. Used as the primary grouping key in feature parquets, fold labels, and delivery CSVs.

Cardinality: One integer value per harvest year modelled. Hindcast runs span feature_start_year..feature_end_year (config.py:641).

Key attributes: Integer year (e.g. 2023); resolved via CommodityConfig.season_start_date(season_year) to a calendar date. Cross-year crops (winter wheat) use season_start_year_offset to shift the calendar anchor.

Source citations: config.py:285 (season_start_year_offset); config.py:360 (season_start_date method); DOMAIN_MODEL.md §1 (temporal vocabulary).

Notes: season_doy (int, days since season start, can exceed 366) is distinct from calendar_doy (1–366). A SeasonYear maps to a half-open interval [season_start_date(y), harvest_date(y)] on the calendar.

InitDate¶

Definition: A specific calendar date on which a within-season forecast is issued, formatted ISO YYYY-MM-DD. Features in pred.parquet are known up to init_date − lag_days. In hindcast mode, init_dates form a weekly grid derived from CommodityConfig.hindcast_init_season_doys; in forecast mode a single runtime-injected date overrides the grid.

Cardinality: Multiple per season_year (weekly grid, typically ~30 per crop season); one per ForecastSlice.

Key attributes: Calendar date (Python date). Resolved via CommodityConfig.to_date(sdoy, season_year). Stored as YYYY-MM-DD string in delivery CSVs and as a column in parquet feature tables.

Source citations: config.py:668 (init_dates_for method); config.py:375 (hindcast_init_dates method); DOMAIN_MODEL.md §1 (init_date definition).

Notes: lag_days (default 1) separates the issue date from the last included observation. A harvest-init training row overrides to lag 0.

Region (ADM0 / ADM1 / ADM2)¶

Definition: A geographic administrative unit at one of three levels — ADM0 (country), ADM1 (state), ADM2 (county). Identified by a GeoIdentifier — a NewType("GeoIdentifier", str) alias matching ^ADM0:[a-z0-9]+(/ADM1:.+(/ADM2:.+)?)?$. The level is inferred from the prefix; no separate field is stored.

Cardinality: ~3,000–4,000 county-level GeoIdentifiers per commodity in the full panel; reduced to the top-95%-by-production subset included_geo_identifiers (~800–1,200) for model training. ADM0 and ADM1 are derived by aggregation.

Key attributes: GeoIdentifier string (canonical identity). AggregationLevel: Literal["ADM0","ADM1","ADM2"] (lib/geo/aggregation.py:10). included_geo_identifiers: frozenset[str] (persisted in run_dir/included_geo_identifiers.txt by ExperimentResult.save_included_geo_identifiers).

Source citations: DOMAIN_MODEL.md §1 (geo vocabulary); lib/results/run_result.py:157 (save_included_geo_identifiers); features/builders/interface.py:21 (INDEX_COLS — geo_identifier is the second column).

Notes: No FIPS codes, no mixed case — the canonical identifier is the only representation used across all parquets, CSVs, and in-memory joins. Conversions from state/county names happen exclusively in lib/geo/identifiers.py.

Yield¶

Definition: A scalar crop yield measurement or prediction. Internally always in yield_kg_ha (kilograms per hectare); converted to delivery units (bu/ac for grains, lbs/ac for cotton) only at the delivery/ boundary. The target column in training is target_col (typically yield_kg_ha); the detrended residual is target_detrended_col.

Cardinality: One value per (geo_identifier, season_year, init_date) triple in prediction outputs; one per (geo_identifier, season_year) in fit.parquet.

Key attributes: yield_kg_ha (internal), sim_yield_kg_ha (simulated), obs_yield_kg_ha (observed), area_harvested_ha (area weight), production_kg (area × yield). Delivery columns: mean, lower_{50,68,80,90,95}, upper_{50,68,80,90,95}.

Source citations: DOMAIN_MODEL.md §1 (unit vocabulary); delivery/schemas.py:109 (DeliveryRow.mean); config.py:302 (CommodityConfig.target_col).

Notes: Unweighted mean of yields is forbidden — all national aggregations must weight by area_harvested_ha. CI band ordering lower_95 ≤ … ≤ mean ≤ … ≤ upper_95 is a hard invariant enforced by DeliveryRow._validate_ci_ordering.

Fold¶

Definition: A walk-forward cross-validation fold identified by fold_label: str. Numeric labels (e.g. "2020") mean "train on years < 2020, test on 2020"; the literal "production" means the no-holdout fit on all available data. fold_label is a filesystem key — there is no Fold class.

Cardinality: One per test year in ExperimentProtocolConfig.test_years, plus one "production" fold per commodity run.

Key attributes: fold_label (str, identity as filesystem directory component), cutoff (derived date: date(int(fold_label), 1, 1) for numeric; date(feature_end_year + 1, 1, 1) for production).

Source citations: lib/results/results_slice.py:151 (HindcastSlice.cutoff); DOMAIN_MODEL.md §7.5 (canonical cutoff naming); config.py:456 (ExperimentProtocolConfig.test_years).

Notes: The term "cutoff" (adopted from Nixtla / Prophet / Hyndman FPP3) is the canonical name for the moment dividing known past from predicted future — the sharpest unifier across hindcast folds and forecast init_dates.

Tier 2 — Configuration aggregates¶

ExperimentConfig¶

Definition: The frozen, validated root configuration object. Resolves all paths against data_root (INPUT_DATA_DIR), holds every subordinate config block, and is passed as the single config argument to every stage. Inherits pydantic_settings.BaseSettings; resolution order is CLI > env vars > YAML > defaults.

Cardinality: Singleton per pipeline invocation. One config_resolved.yaml is written to run_dir and re-parsed lazily by slice properties.

Key attributes: experiment_name (str, slug-safe regex [a-zA-Z0-9_-]+), data_root (AnyPath from INPUT_DATA_DIR), commodity (CommodityConfig), experiment_protocol (ExperimentProtocolConfig), model (ModelConfig), reference_data (list[ReferenceYieldSpec]), postprocess (PostprocessConfig), delivery (DeliveryConfig), forecast (ForecastConfig | None), feature_start_year, feature_end_year, random_seed, mlflow_tracking_uri.

Source citations: config.py:573 (ExperimentConfig class); config.py:600 (experiment_name validator); config.py:604 (data_root field); config.py:756 (_fill_defaults_from_data_root validator).

Notes: forecast is None ⇒ hindcast mode; forecast set ⇒ forecast mode. The config carries build_detrender() and build_regressor() factory methods as a pragmatic convenience (flagged as a TODO in the source at config.py:682). All ResolvablePath fields are resolved at construction time via _resolve_data_paths.

CommodityConfig¶

Definition: Source of truth for all commodity-specific constants: crop calendar, builder specs, feature/target column names, unit weights, and plausibility bounds. Frozen Pydantic model nested inside ExperimentConfig.

Cardinality: One per ExperimentConfig (1:1).

Key attributes: See Tier 1 Commodity entry. Additional: climo_windows (tuple[SeasonWindow, …]), weather_windows (tuple[SeasonWindow, …]), weather_vars, climo_zscore_vars, actuals_source_short (default "NASS"), auxiliary_cols, freeze_cap_sdoy.

Source citations: config.py:271 (class); config.py:337 (builders field); config.py:339 (_inject_builder_type_from_key validator).

Notes: The builders dict validator auto-injects the dict key as the discriminator type field so YAML configs need not repeat type: under each builder entry.

ModelConfig¶

Definition: Selects the detrend strategy and regression estimator for the FIT stage. Frozen, with extra="forbid" to prevent silent config drift.

Cardinality: One per ExperimentConfig.

Key attributes: detrend: Literal["linear_state","gaussian_state","partial_pooling"] (default "partial_pooling"), regression: Literal["ridge","xgboost","pca_ridge"] (default "ridge"), detrend_params, regression_params, weather_correction_fit_level: Literal["ADM0","ADM1","ADM2"] | None, use_sample_weights: bool, weight_column (default "area_harvested_ha").

Source citations: config.py:464 (class); config.py:491 (_validate_sample_weight_usage — use_sample_weights=True requires weather_correction_fit_level="ADM2").

FeatureBuilderConfig (and concrete variants)¶

Definition: Discriminated union Builder over five concrete builder configs, each specifying how to load one data source into a parquet keyed by (year, geo_identifier, init_date). Common base is BaseBuilderConfig.

Cardinality: One entry per source in CommodityConfig.builders dict (1:N from CommodityConfig). Typically four to five builders per commodity.

Key attributes (base): filepath (ResolvablePath), geo_id_col, required_for_pred_parquet (bool — inner join if True, left join if False).

Variants and discriminator values:

Class	`type`
`YieldsBuilder`	`yields`
`WeatherBuilder`	`weather`
`ClimoBuilder`	`climo`
`NDVIBuilder`	`ndvi`
`StressBuilder`	`stress`

YieldsBuilder additionally carries edits: list[EditRuleConfig] (Fellegi-Holt edit rules applied before the feature pivot).

Source citations: config.py:165 (BaseBuilderConfig); config.py:187 (YieldsBuilder); config.py:247 (Builder union); config.py:213 (StressBuilder); config.py:200 (AssembleStressConfig).

Notes: AssembleStressConfig is a nested value object on StressBuilder that controls regenerating the stress parquet from a raw indices zarr.

ReferenceYieldSpec¶

Definition: Discriminated union over external reference-yield loader specifications. Each spec drives one column prefix of benchmarks (metrics tables, delivery columns, plot traces). Parsed from ExperimentConfig.reference_data: list[ReferenceYieldSpec].

Cardinality: Zero or more per ExperimentConfig; each name must be unique within the list (validated at config load, config.py:799).

Variants:

Class	`kind`	Source
`WasdeRefSpec`	`wasde`	USDA WASDE in-season national estimates
`ConabFinalRefSpec`	`conab_final`	CONAB final Brazil soybean yield
`ConabLevantamentoRefSpec`	`conab_levantamento`	CONAB levantamento (in-season release)

Source citations: lib/reference_data/loader.py:59 (ReferenceYieldSpec union); lib/reference_data/loader.py:51 (spec-to-loader registration table).

Notes: NassSpec from the orchestrator's seed list does not exist — NASS yield is loaded as a feature via YieldsBuilder, not via this union. This is a deviation from the seed vocabulary.

BiasCorrectorConfig¶

Definition: Configures the national-scale residual bias corrector fitted during POSTPROCESS.

Cardinality: One per PostprocessConfig.

Key attributes: kind: Literal["none","coverage"] — selects NoBiasCorrector (identity) or CoverageBiasCorrector (NASS in/out-county gap); n_lookback_years, reduction_method.

Source citations: config.py:501 (class).

ConformalConfig (PostprocessConfig.conformalise)¶

Definition: Tuple of conformal calibration modes to fit and persist during POSTPROCESS. The first element is the primary mode whose half-widths populate delivery CSVs; later elements exist as diagnostic sidecars.

Cardinality: One tuple per PostprocessConfig; at least one element required.

Key attributes: conformalise: tuple[Literal["hindcast_oos_per_init_date","hindcast_oos_per_year","hindcast_oos_fully_pooled","in_sample_pooled"], …] (default: ("hindcast_oos_per_init_date",)).

Source citations: config.py:518 (PostprocessConfig); config.py:532 (conformalise field); models/meta_models/conformalise.py:64 (ResidualMode literal).

EditRuleConfig¶

Definition: Discriminated union of Fellegi-Holt edit rules applied to raw survey data before feature assembly. Each rule detects a condition (ratio out of tolerance, value out of range, null) then applies a corrective operation (impute, clip, flag, drop, fail, panel trailing median).

Cardinality: Zero or more per YieldsBuilder.edits list; applied sequentially in YAML declaration order.

Variants and operations:

Rule	`kind`	Fires when
`RatioEditRule`	`ratio_edit`	`target/derive ∉ [1/tol, tol]`
`RangeEditRule`	`range_edit`	`target ∉ [min, max]`
`NullImputeRule`	`null_impute`	`target.isna()`
`PanelNullImputeRule`	`panel_null_impute`	`target.isna()` (panel-aware)

Each rule's on_fail field holds an EditOperation — a discriminated union of corrective actions applied when the detection condition fires. The corrective-action members are:

Class	`edit.py` line	Effect
`DeductiveImpute`	91	Replace `target` with `derive / ratio` (ratio rule only)
`Clip`	113	Clamp `target` to `[min, max]`
`Flag`	149	Add boolean column `{target}_flagged`; keep value
`Drop`	165	Drop the row entirely
`Fail`	181	Raise `ValueError` naming the offending rows
`PanelTrailingMedian`	201	Replace with geo-grouped trailing median (window configurable)

EditOperation = DeductiveImpute | Clip | Flag | Drop | Fail | PanelTrailingMedian is a discriminated union nested under _EditRuleBase.on_fail. It is distinct from the detection-rule union EditRuleConfig (four members: RatioEditRule, RangeEditRule, NullImputeRule, PanelNullImputeRule).

Source citations: lib/edit_and_imputation/edit.py:361 (EditRuleConfig union); lib/edit_and_imputation/edit.py:370 (EditReport dataclass).

ForecastConfig¶

Definition: Forecast-time path specifications plus the runtime-injected init_date. Present only when ExperimentConfig.forecast is not None; its presence switches the pipeline into forecast mode.

Cardinality: Zero or one per ExperimentConfig.

Key attributes: raw_obs_filepath (ResolvablePath), materialised_climo_filepath (ResolvablePath), init_date: date | None (injected at runtime by build_forecast_features).

Source citations: config.py:552 (class).

Tier 3 — Pipeline artefacts¶

RunDir¶

Definition: The on-disk root directory for a single experiment run. The sole hand-off contract between pipeline stages — no in-memory objects cross stage boundaries. Identified by a timestamped path under run_dir_base / experiment_name /.

Cardinality: One per pipeline invocation. Immutable once created; stages write atomically (temp then rename).

Key attributes (directory layout):

{run_dir}/
  config_resolved.yaml
  included_geo_identifiers.txt
  models/{commodity}/{fold_label}/  — detrender.pkl, model.*, feature_fill_values.parquet
  preds/{commodity}/{fold_label}/   — train_preds, walk_forward_preds, year_data
  postprocessed/national.parquet
  postprocessed/{commodity}/{fold_label}/bias_corrector.pkl
  conformal/{mode}.parquet
  reports/
  delivery/Treefera_{commodity}_{ADM}_Hindcast_{YYYYMMDD}.csv
  forecast/{season_year}/{init_date}/  — see ForecastSlice

Source citations: DOMAIN_MODEL.md §1 (run_dir definition); lib/results/run_result.py:40 (from_run_dir); lib/results/results_slice.py:112 (HindcastSlice path properties).

Notes: has_postprocessed and has_walk_forward_preds are boolean existence checks on specific paths, not a state field; stage sequencing is enforced by artefact dependencies alone.

ExperimentResult¶

Definition: Frozen dataclass that is a lazy handle to all artefacts under one run_dir. The aggregate root for the Experiment bounded context. Holds ExperimentConfig plus optional tuples of HindcastSlice and ForecastSlice. Carries no computed data in memory — disk is the contract.

Cardinality: One per run_dir. Reconstructed on demand by ExperimentResult.from_run_dir(run_dir).

Key attributes: config (ExperimentConfig), hindcast_slices (tuple[HindcastSlice, …]), forecast_slices (tuple[ForecastSlice, …]), run_dir (Path). Properties: features_fit_path, features_pred_path, has_postprocessed, has_walk_forward_preds, production (HindcastSlice | None), postprocessed_national_path.

Source citations: lib/results/run_result.py:31 (class); lib/results/run_result.py:39 (from_run_dir); lib/results/run_result.py:175 (production property).

Notes: Both slice collections are optional — a fresh run_dir with only config_resolved.yaml yields empty tuples. If forecast_slices is non-empty, a production HindcastSlice must exist (enforced by ForecastSlice.training).

HindcastSlice¶

Definition: Lazy handle to one walk-forward fold's artefacts on disk. Satisfies the AbstractSlice protocol. Identity is (run_dir, fold_label). Exposes paths and loaders for every artefact relevant to that fold.

Cardinality: One per fold_label in ExperimentResult.hindcast_slices (N numeric folds + 1 production fold per commodity run).

Key attributes: run_dir (Path), fold_label (str), train_preds_path, model_path, detrender_path, feature_fill_values_path, year_data_path, walk_forward_preds_path, bias_corrector_path (all Path | CloudPath). Derived: cutoff (date), features_fit_path, features_pred_path, has_bias_corrector.

Source citations: lib/results/results_slice.py:112 (class); lib/results/results_slice.py:131 (from_config); lib/results/results_slice.py:151 (cutoff property).

Notes: Feature matrices (fit.parquet, pred.parquet) are shared across all slices in a run — they live in cfg.features_dir / {commodity} /, not under run_dir/. Trained artefacts (model, detrender, fill values) are unique per fold and owned by the slice.

ForecastSlice¶

Definition: Lazy handle to one in-season forecast identified by (run_dir, commodity, season_year, init_date). Satisfies the AbstractSlice protocol. Splices its own indices + features (unique per init) but delegates trained artefacts to the production HindcastSlice via the training property.

Cardinality: One per (season_year, init_date) pair in ExperimentResult.forecast_slices. Multiple (season_year, init_date) pairs can coexist under one run_dir in disjoint subtrees (run_dir/forecast/{season_year}/{init_date}/).

Key attributes: run_dir (Path), commodity (str), season_year (int), init_date (date). Path properties: root, indices_zarr, features_parquet, walk_forward_preds_path, year_data_path, postprocessed_national_path, delivery_csv(level). Delegation: training → production HindcastSlice.

Source citations: lib/results/results_slice.py:299 (class); lib/results/results_slice.py:409 (training property); lib/results/results_slice.py:326 (root path).

Notes: The indices.zarr holds the spliced observed-plus-climatology weather up to init_date; the per-init features/pred.parquet is the forecast feature matrix and never collides with the canonical hindcast copy.

CalibrationResult¶

Definition: Fitted conformal half-widths derived from an experiment's OOS residuals. Persistable dataclass; saves to / loads from a long-format parquet keyed by (fold_year | fold_init_md, level). Exposes predict_interval(init_date, level) for forecast pipelines.

Cardinality: One per (residual_mode, commodity) per run. Written to run_dir/conformal/{mode}.parquet during POSTPROCESS.

Key attributes: residual_mode (ResidualMode literal), method (ConformalMethod), commodity (str), levels (tuple[float, …]), n_residuals (int). Exactly one of per_init_md, per_year, pooled is populated, determined by residual_mode.

Source citations: models/meta_models/conformalise.py:111 (class); models/meta_models/conformalise.py:208 (save); models/meta_models/conformalise.py:217 (load).

Tier 4 — Behavioural roles¶

FeatureBuilder (Protocol)¶

Definition: BuilderFn protocol — a callable (path, cfg, years) → pd.DataFrame. Every concrete builder module exposes one function satisfying this contract. The output must contain INDEX_COLS = ("year", "geo_identifier", "init_date") with no duplicate keys.

Cardinality: One implementation per data source (five: yields, weather, climo, ndvi, stress).

Key attributes (protocol): __call__(path: Path | CloudPath, cfg: ExperimentConfig, years: range) → pd.DataFrame. Validated by validate_builder_output.

Source citations: features/builders/interface.py:25 (BuilderFn protocol); features/builders/interface.py:21 (INDEX_COLS).

Notes: runtime_checkable — implementations can be type-checked at test time via isinstance(fn, BuilderFn). Validation enforces geo_identifier format via assert_valid_geo_identifiers.

Detrender (AbstractDetrend)¶

Definition: Abstract base class for yield trend-removal models. Fits a per-county temporal trend on the training panel and transforms raw yield into a detrended residual; inverts the transformation at prediction time to recover yield-scale predictions.

Cardinality: One fitted instance per fold, persisted as detrender.pkl in models/{commodity}/{fold_label}/.

Key attributes (interface): fit_transform(features) → DataFrame, transform(features) → DataFrame, inverse_transform(features, y_detrended) → Series, fitted_yield_series(features) → Series, save(path), load(path, config) → Self.

Concrete implementations: LinearStateDetrend (linear_state), GaussianWindowStateDetrend (gaussian_state), PartialPoolingDetrend (partial_pooling).

Source citations: models/detrend/base.py:21 (AbstractDetrend); models/detrend/base.py:46 (target_detrended_column).

Regressor (AbstractRegressionImpl)¶

Definition: Abstract base class for the residual regression estimator. Operates on detrended yield residuals after imputation. Fits on feature_cols only; auxiliary columns are carried for weighting/IDs outside the estimator.

Cardinality: One fitted instance per fold, persisted in models/{commodity}/{fold_label}/ alongside the detrender.

Key attributes (interface): fit(X, y, sample_weight=None) → Self, predict(X) → Series, save_model(path), load_model(path) → Self.

Concrete implementations: RidgeRegressor (ridge), PcaRidgeRegressor (pca_ridge), XGBRegressor (xgboost).

Source citations: models/regression/base.py:9 (AbstractRegressionImpl); lib/results/results_slice.py:182 (load_model dispatch logic).

Notes: NaN inputs are rejected at the estimator boundary (nan_policy="raise" is the only permitted policy); imputation must happen upstream in the FIT stage.

MetaModel — BiasCorrector¶

Definition: National-scale residual correction fitted on fold predictions against NASS/CONAB observed yield. Kinds: NoBiasCorrector (no-op, for experiments without bias correction) and CoverageBiasCorrector (scalar correction from in-county / all-county NASS gap over a lookback window). Persisted as bias_corrector.pkl.

Cardinality: One per fold (including production), stored at postprocessed/{commodity}/{fold_label}/bias_corrector.pkl. Presence is optional — has_bias_corrector guards access.

Source citations: config.py:501 (BiasCorrectorConfig); lib/results/results_slice.py:40 (_bias_corrector_path); DOMAIN_MODEL.md §1 (bias_corrector definition).

MetaModel — Conformaliser¶

Definition: Derives conformal prediction intervals from walk-forward OOS residuals. Four residual_mode recipes produce CalibrationResult objects that are saved as parquets and applied at delivery time to populate lower_* / upper_* columns.

Cardinality: One CalibrationResult per (mode, run_dir), written during POSTPROCESS.

Source citations: models/meta_models/conformalise.py:1 (module docstring); config.py:532 (conformalise tuple in PostprocessConfig).

ReferenceYieldLoader¶

Definition: Abstract base class plus factory for loading external reference yield time series (WASDE, CONAB). Each ReferenceYieldSpec discriminates to a concrete loader via ReferenceYieldLoader.from_spec(spec, commodity). Returns a DataFrame keyed by marketing_year and release_date.

Cardinality: One per ReferenceYieldSpec entry in cfg.reference_data. Constructed by build_loaders(cfg).

Concrete implementations: WasdeLoader, ConabFinalLoader, ConabLevantamentoLoader.

Source citations: lib/reference_data/loader.py:68 (build_loaders); lib/reference_data/loader.py:51 (loader registration); lib/reference_data/loader.py:59 (ReferenceYieldSpec union).

Tier 5 — Delivery & validation¶

HindcastDelivery¶

Definition: Frozen Pydantic model holding the complete validated delivery dataset for one commodity × ADM level × generated date. Aggregate root of the Delivery bounded context. Enforces structural integrity across rows via model validators.

Cardinality: Three instances per hindcast run (one per ADM level: ADM0, ADM1, ADM2), each written to a separate CSV.

Key attributes: rows: list[DeliveryRow], generated_date: str (ISO YYYY-MM-DD). Invariants: no duplicate (year, init_date, geo_identifier); all (year, geo) groups have equal init_date count; generated_date is ISO format.

Source citations: delivery/schemas.py:227 (class); delivery/schemas.py:249 (_validate_no_duplicate_keys); delivery/schemas.py:260 (_validate_fold_consistency).

DeliveryRow (ADM0Row / ADM1Row / ADM2Row)¶

Definition: Frozen Pydantic model representing a single row in the client-facing delivery CSV. Contains yield prediction, conformal intervals, and benchmark columns. All yield values in delivery units (bu/ac or lbs/ac).

Cardinality: One per (geo_identifier, season_year, init_date, variable, model) tuple in a HindcastDelivery.

Key attributes: Identity: commodity, year, init_date (ISO str), geo_identifier, variable (default "yield_bu_acre"), model (default "commodity_hindcast"). Prediction: mean (float). Benchmarks: nass_actual, nass_actual_area_weighted_all, nass_actual_prod_div_area_all, wasde_in_season, conab_final_in_season, conab_lev_in_season. Corrections: weather_correction_bu_ac. CI bands: lower_{50,68,80,90,95}, upper_{50,68,80,90,95} (all optional).

Source citations: delivery/schemas.py:109 (class); delivery/schemas.py:176 (_validate_ci_ordering); delivery/schemas.py:197 (_validate_init_date_year).

Notes: extra="forbid" is load-bearing — unknown columns from new ReferenceYieldSpec names raise a ValidationError rather than silently vanishing. LONG_RANGE_HORIZON_YEARS = 10 (delivery/schemas.py:106) allows init_dates up to 10 years before the target season (long-range forecasts).

Notes on ADM variants: delivery/schemas.py defines a single DeliveryRow class — there are no separate ADM0Row/ADM1Row/ADM2Row classes in the code. The orchestrator's seed list used those names as conceptual variants; the canonical code name is DeliveryRow.

Check (preflight)¶

Definition: Value object representing the result of a single preflight validation check. critical=True failures cause run_preflight() to raise SystemExit, aborting the run before any compute begins.

Cardinality: Multiple per stage entry point; one per declared check in preflight_paths_for_{stage} functions.

Key attributes: name: str, passed: bool, message: str, critical: bool.

Source citations: run/preflight.py:20 (dataclass definition per DOMAIN_MODEL2.md §4.9); DOMAIN_MODEL.md §1 (preflight Check definition).

FoldSchedule¶

Definition: Per-commodity fold calendar used by the Streamlit dashboard to map season dates to available fold_labels. Lives in the dashboard layer (app/_dashboard_config.py:198) and is out of scope for the core pipeline domain model.

Cardinality: One per commodity in the dashboard config.

Source citations: DOMAIN_MODEL2.md §4.10 (other value objects table).

Notes: Included for completeness; downstream wiki entity pages for the dashboard bounded context should document it fully.