Entity: ReferenceYieldSpec¶
Definition¶
ReferenceYieldSpec is a pydantic-discriminated union over three concrete spec classes, each of which describes how to load one external reference yield series (WASDE or CONAB). Each spec drives one column-prefix worth of benchmarks in metrics tables, delivery CSVs, and plot traces. Parsed from ExperimentConfig.reference_data: list[ReferenceYieldSpec]; each entry's name must be unique within the list.
ReferenceYieldSpec is NOT a class — it is a pydantic Annotated type alias:
# lib/reference_data/loader.py:59
ReferenceYieldSpec = Annotated[
WasdeRefSpec | ConabFinalRefSpec | ConabLevantamentoRefSpec,
Field(discriminator="kind"),
]
The discriminator field is kind (not type). This mirrors EditRuleConfig at lib/edit_and_imputation/edit.py:361-364.
Kind¶
Pydantic discriminated union type alias. The three concrete classes all inherit _ReferenceYieldSpecBase (lib/reference_data/base_reference_yield_loader.py:36), which is frozen=True, extra="forbid".
Source of truth¶
- Union declaration:
market_insights_models/src/commodity_hindcast/lib/reference_data/loader.py:59 - Base class:
market_insights_models/src/commodity_hindcast/lib/reference_data/base_reference_yield_loader.py:36 - Dispatch table registration:
loader.py:51-53
Discriminated-union dispatch¶
| Class | kind value |
Concrete loader | Source |
|---|---|---|---|
WasdeRefSpec |
"wasde" |
WasdeLoader |
USDA WASDE in-season national estimates |
ConabFinalRefSpec |
"conab_final" |
ConabFinalLoader |
CONAB série histórica (post-harvest final) |
ConabLevantamentoRefSpec |
"conab_levantamento" |
ConabLevantamentoLoader |
CONAB levantamento (monthly in-season release) |
Dispatch is performed by ReferenceYieldLoader.from_spec(spec, commodity_cfg) (base_reference_yield_loader.py:129) which looks up type(spec) in the _registry dict populated by three explicit register() calls in loader.py:51-53:
ReferenceYieldLoader.register(WasdeRefSpec, WasdeLoader)
ReferenceYieldLoader.register(ConabFinalRefSpec, ConabFinalLoader)
ReferenceYieldLoader.register(ConabLevantamentoRefSpec, ConabLevantamentoLoader)
The registry is a class-level dict (_registry: ClassVar[dict[...]]); registration is explicit rather than via __init_subclass__ to keep the dispatch table greppable.
Base attributes (_ReferenceYieldSpecBase)¶
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
name |
str |
required | Metric and column prefix (e.g. "wasde" → columns wasde_in_season, metric wasde_jan_mae) |
wasde |
filepath |
ResolvablePath |
required | Path to the reference data file; resolved against data_root at config load; preflight-checked automatically |
data/wasde/wasde_corn_us_yield.csv |
commodity |
str |
required | Lower-case commodity name to filter multi-commodity source files | corn |
geography |
str |
required | Lower-case geography token to filter rows | united_states |
cutoff_month_day |
MonthDay |
required | Calendar day after which the harvest_year value is treated as final. Used by yield_final() lookups |
{month: 2, day: 1} |
unit |
Literal["bu_acre","kg_per_ha"] |
"kg_per_ha" |
Source unit of value rows emitted by load(). WasdeRefSpec overrides to "bu_acre" |
bu_acre |
Each concrete subclass additionally carries a kind: Literal[...] field matching its discriminator value.
Loader contract¶
ReferenceYieldLoader.load() returns a DataFrame with columns:
marketing_year: int — harvest calendar year.release_date:pd.Timestamp(naive) — publication date of this estimate.geography,commodity,variable(always"yield"): str.value: float — yield in kg/ha (converted from source unit by each concrete loader).
Sorted by (marketing_year, release_date). The ABC contract is kg/ha; WasdeLoader converts from bu/acre on read.
Concrete lookup methods on the base:
yield_asof(harvest_year, init_date)— latest release strictly beforeinit_date.yield_final(harvest_year)— last release beforecutoff_month_dayofharvest_year.yield_asof_array(harvest_year, init_dates)— vectorised version for delivery row assembly.yield_final_series()— Series of finals indexed by marketing year.
Results are cached on the loader instance (self._df) to avoid re-parsing the source file.
Lifecycle¶
- Each
ReferenceYieldSpecentry in the YAMLreference_data:list is validated by pydantic against the discriminated union, dispatching onkind. _reference_data_names_unique(config.py:831) ensures allnamevalues are unique.ExperimentConfig._resolve_data_pathsresolves allfilepathfields.build_loaders(cfg)(loader.py:68) constructs oneReferenceYieldLoaderper spec at pipeline startup.build_references_by_harvest_year(cfg)(loader.py:78) loads all specs and partitions them by marketing year for efficient per-fold access inExpandingFoldGenerator.
Relationships¶
- Owned by
ExperimentConfig.reference_data(0..N per run; names unique). - Dispatches to
ReferenceYieldLoaderconcrete implementations via the_registry. - Consumed by
ExpandingFoldGenerator(per-fold reference slice),delivery/conversions.py(benchmark columns),diagnostics/runners.py(metrics), and dashboard plots.
Naming note¶
NassSpec from the orchestrator's original seed vocabulary does not exist in code. NASS yield is loaded as a feature via YieldsBuilder, not via the reference-data union. This is documented in ENTITIES.md §Tier 2 and confirmed by the absence of any NassSpec or NassRefSpec class in the source tree.
Concepts and pipelines that touch this entity¶
- Pipeline: hindcast evaluation — reference data loaded per fold for WASDE / CONAB benchmark columns.
- Pipeline: forecast delivery —
yield_asof_arraypopulateswasde_in_season,conab_lev_in_season,conab_final_in_seasoncolumns. - Entity: DeliveryRow — benchmark columns are named after
spec.name.
PRs and commits¶
- PR #339 (PR-339.md) —
_ReferenceYieldSpecBaseintroduced;cutoff_month_day,geography, andunitfields generalised from hardcoded WASDE constants. - PR #353 —
ConabFinalRefSpecandConabLevantamentoRefSpecadded for the Brazil soybean config.
Open questions¶
ConabLevantamentoRefSpecandConabFinalRefSpecshare the samecutoff_month_day: month 10, day 1in production configs. Whether any future non-Brazil CONAB user would need a different cutoff is not documented.- The
geographyfield is a free-form lowercase token ("united_states","brazil"); there is no enum validation. A typo would produce an empty DataFrame at runtime.