Skip to content

Entity: ReferenceYieldSpec

Definition

ReferenceYieldSpec is a pydantic-discriminated union over three concrete spec classes, each of which describes how to load one external reference yield series (WASDE or CONAB). Each spec drives one column-prefix worth of benchmarks in metrics tables, delivery CSVs, and plot traces. Parsed from ExperimentConfig.reference_data: list[ReferenceYieldSpec]; each entry's name must be unique within the list.

ReferenceYieldSpec is NOT a class — it is a pydantic Annotated type alias:

# lib/reference_data/loader.py:59
ReferenceYieldSpec = Annotated[
    WasdeRefSpec | ConabFinalRefSpec | ConabLevantamentoRefSpec,
    Field(discriminator="kind"),
]

The discriminator field is kind (not type). This mirrors EditRuleConfig at lib/edit_and_imputation/edit.py:361-364.

Kind

Pydantic discriminated union type alias. The three concrete classes all inherit _ReferenceYieldSpecBase (lib/reference_data/base_reference_yield_loader.py:36), which is frozen=True, extra="forbid".

Source of truth

  • Union declaration: market_insights_models/src/commodity_hindcast/lib/reference_data/loader.py:59
  • Base class: market_insights_models/src/commodity_hindcast/lib/reference_data/base_reference_yield_loader.py:36
  • Dispatch table registration: loader.py:51-53

Discriminated-union dispatch

Class kind value Concrete loader Source
WasdeRefSpec "wasde" WasdeLoader USDA WASDE in-season national estimates
ConabFinalRefSpec "conab_final" ConabFinalLoader CONAB série histórica (post-harvest final)
ConabLevantamentoRefSpec "conab_levantamento" ConabLevantamentoLoader CONAB levantamento (monthly in-season release)

Dispatch is performed by ReferenceYieldLoader.from_spec(spec, commodity_cfg) (base_reference_yield_loader.py:129) which looks up type(spec) in the _registry dict populated by three explicit register() calls in loader.py:51-53:

ReferenceYieldLoader.register(WasdeRefSpec, WasdeLoader)
ReferenceYieldLoader.register(ConabFinalRefSpec, ConabFinalLoader)
ReferenceYieldLoader.register(ConabLevantamentoRefSpec, ConabLevantamentoLoader)

The registry is a class-level dict (_registry: ClassVar[dict[...]]); registration is explicit rather than via __init_subclass__ to keep the dispatch table greppable.

Base attributes (_ReferenceYieldSpecBase)

Field Type Default Meaning YAML example
name str required Metric and column prefix (e.g. "wasde" → columns wasde_in_season, metric wasde_jan_mae) wasde
filepath ResolvablePath required Path to the reference data file; resolved against data_root at config load; preflight-checked automatically data/wasde/wasde_corn_us_yield.csv
commodity str required Lower-case commodity name to filter multi-commodity source files corn
geography str required Lower-case geography token to filter rows united_states
cutoff_month_day MonthDay required Calendar day after which the harvest_year value is treated as final. Used by yield_final() lookups {month: 2, day: 1}
unit Literal["bu_acre","kg_per_ha"] "kg_per_ha" Source unit of value rows emitted by load(). WasdeRefSpec overrides to "bu_acre" bu_acre

Each concrete subclass additionally carries a kind: Literal[...] field matching its discriminator value.

Loader contract

ReferenceYieldLoader.load() returns a DataFrame with columns:

  • marketing_year: int — harvest calendar year.
  • release_date: pd.Timestamp (naive) — publication date of this estimate.
  • geography, commodity, variable (always "yield"): str.
  • value: float — yield in kg/ha (converted from source unit by each concrete loader).

Sorted by (marketing_year, release_date). The ABC contract is kg/ha; WasdeLoader converts from bu/acre on read.

Concrete lookup methods on the base:

  • yield_asof(harvest_year, init_date) — latest release strictly before init_date.
  • yield_final(harvest_year) — last release before cutoff_month_day of harvest_year.
  • yield_asof_array(harvest_year, init_dates) — vectorised version for delivery row assembly.
  • yield_final_series() — Series of finals indexed by marketing year.

Results are cached on the loader instance (self._df) to avoid re-parsing the source file.

Lifecycle

  1. Each ReferenceYieldSpec entry in the YAML reference_data: list is validated by pydantic against the discriminated union, dispatching on kind.
  2. _reference_data_names_unique (config.py:831) ensures all name values are unique.
  3. ExperimentConfig._resolve_data_paths resolves all filepath fields.
  4. build_loaders(cfg) (loader.py:68) constructs one ReferenceYieldLoader per spec at pipeline startup.
  5. build_references_by_harvest_year(cfg) (loader.py:78) loads all specs and partitions them by marketing year for efficient per-fold access in ExpandingFoldGenerator.

Relationships

  • Owned by ExperimentConfig.reference_data (0..N per run; names unique).
  • Dispatches to ReferenceYieldLoader concrete implementations via the _registry.
  • Consumed by ExpandingFoldGenerator (per-fold reference slice), delivery/conversions.py (benchmark columns), diagnostics/runners.py (metrics), and dashboard plots.

Naming note

NassSpec from the orchestrator's original seed vocabulary does not exist in code. NASS yield is loaded as a feature via YieldsBuilder, not via the reference-data union. This is documented in ENTITIES.md §Tier 2 and confirmed by the absence of any NassSpec or NassRefSpec class in the source tree.

Concepts and pipelines that touch this entity

PRs and commits

  • PR #339 (PR-339.md) — _ReferenceYieldSpecBase introduced; cutoff_month_day, geography, and unit fields generalised from hardcoded WASDE constants.
  • PR #353ConabFinalRefSpec and ConabLevantamentoRefSpec added for the Brazil soybean config.

Open questions

  • ConabLevantamentoRefSpec and ConabFinalRefSpec share the same cutoff_month_day: month 10, day 1 in production configs. Whether any future non-Brazil CONAB user would need a different cutoff is not documented.
  • The geography field is a free-form lowercase token ("united_states", "brazil"); there is no enum validation. A typo would produce an empty DataFrame at runtime.