Skip to content

ReferenceYieldLoader

Definition

A ReferenceYieldLoader loads an external reference yield time series — WASDE, CONAB final, or CONAB levantamento — and exposes it as a standardised DataFrame keyed by (marketing_year, release_date) with values in kg/ha. The ABC is ReferenceYieldLoader at lib/reference_data/base_reference_yield_loader.py:77. Each loader is paired with a corresponding spec class; a discriminated union ReferenceYieldSpec (lib/reference_data/loader.py:59) drives YAML parsing and dispatch.

The ABC contract is kg/ha always. Concrete loaders convert from their source unit on read; consumers see kg/ha only.

NASS is not a ReferenceYieldLoader. NASS county yield is loaded as a feature via YieldsBuilder (FeatureBuilder), not via this hierarchy. The NassSpec name in early design documents has no code backing.

Kind

ABC (ReferenceYieldLoader at lib/reference_data/base_reference_yield_loader.py:77).

Source of truth

market_insights_models/src/commodity_hindcast/lib/reference_data/base_reference_yield_loader.py:77

Spec base

_ReferenceYieldSpecBase (base_reference_yield_loader.py:36) is the frozen Pydantic base shared by every ReferenceYieldSpec subclass:

Field Type Purpose
name str Metric / column prefix downstream (e.g. wasde_in_season).
filepath ResolvablePath Anchored at data_root; preflight-checked automatically.
commodity str Lower-case commodity name (e.g. "soybeans", "corn").
geography str Lower-case geography token (e.g. "united_states", "brazil").
cutoff_month_day MonthDay Calendar day after which the value for harvest_year is treated as final.
unit Literal["bu_acre", "kg_per_ha"] Source unit; WasdeRefSpec overrides to "bu_acre". Default "kg_per_ha".

Required interface

Method Signature Notes
load (abstract) () → DataFrame Returns release frame with columns marketing_year, release_date, geography, commodity, variable, value (kg/ha). Sorted by (marketing_year, release_date); no duplicates. Empty input → ValueError.
yield_asof (concrete) (harvest_year, init_date) → float \| None Latest release strictly before init_date for harvest_year. base_reference_yield_loader.py:163.
yield_final (concrete) (harvest_year) → float \| None Last release before spec.cutoff_month_day of harvest_year. base_reference_yield_loader.py:185.
yield_asof_array (concrete) (harvest_year, init_dates) → ndarray Vectorised yield_asof for a fixed harvest_year. Hot path for delivery and diagnostics. base_reference_yield_loader.py:202.
yield_final_series (concrete) () → Series Series of finals indexed by marketing_year; name is f"{spec.name}_final_kg_ha". base_reference_yield_loader.py:220.

The load() result is cached on self._df at first call; subsequent lookups return the cached frame without re-parsing the source file.

Dispatch

Registry: ReferenceYieldLoader._registry (base_reference_yield_loader.py:97) is a class-level dict[type[_ReferenceYieldSpecBase], type[ReferenceYieldLoader]] populated by explicit ReferenceYieldLoader.register(spec_type, loader_type) calls from loader.py after the concrete modules are imported. Explicit registration (rather than __init_subclass__ magic) keeps the dispatch table greppable.

Factory: ReferenceYieldLoader.from_spec(spec, commodity_cfg) (base_reference_yield_loader.py:129) looks up type(spec) in _registry and instantiates the matched loader. Raises TypeError on unknown spec type.

YAML entry point: ReferenceYieldSpec at loader.py:59:

ReferenceYieldSpec = Annotated[
    WasdeRefSpec | ConabFinalRefSpec | ConabLevantamentoRefSpec,
    Field(discriminator="kind"),
]

Parsed from ExperimentConfig.reference_data: list[ReferenceYieldSpec]. Each name must be unique within the list (validated at config load, config.py:799).

Concrete implementations

Spec class Loader class kind File Source format Unit Notes
WasdeRefSpec WasdeLoader wasde wasde.py:25 / wasde.py:32 WASDE CSV bu_acre (default for WasdeRefSpec); converted to kg/ha on read Filters by geography/commodity/variable; parses marketing_year from "YYYY/YY" strings.
ConabFinalRefSpec ConabFinalLoader conab_final conab.py:25 / conab.py:94 CONAB Serie Historica; latin-1, semicolon-delimited Source-native (t/ha); converted to kg/ha on read Aggregates UF rows to national by summing production and area then recomputing yield; produtividade_mil_ha_mil_t column ignored. Overrides yield_final because release_date equals the cutoff.
ConabLevantamentoRefSpec ConabLevantamentoLoader conab_levantamento conab.py:37 / conab.py:197 12 monthly CONAB Levantamento bulletins per safra; no publication date column Source-native; converted to kg/ha on read Release dates derived from _LEV_RELEASE_CALENDAR (conab.py:167) — a module-level dict mapping LEV ordinal (1–12) to (calendar_month, year_offset) anchored against three published bulletins.

Note on NASS: NASS county panel data is loaded by build_yields (FeatureBuilder) as a training feature, and by load_nass_obs / load_nass_county_panel_yield_area (lib/reference_data/nass.py) for bias-correction fitting. It does not flow through the ReferenceYieldLoader hierarchy.

Lifecycle

Instantiation: build_loaders(cfg) at loader.py:68 constructs one ReferenceYieldLoader per spec on cfg.reference_data. Single centralised dispatch so delivery, diagnostics, and the fold generator share the same loader list.

Pre-build: build_references_by_harvest_year(cfg) at loader.py:78 builds a spec.name → harvest_year → release DataFrame dispatch table for ExpandingFoldGenerator; avoids re-loading source files per fold.

Invocation: Loaders are called at delivery time (delivery/conversions.py) and in diagnostic runners (diagnostics/runners.py) via yield_asof_array (vectorised hot path) and yield_final_series.

Tear-down: None. The _df cache is per-instance; it is released when the loader object goes out of scope.

Relationships

  • ExperimentConfig — carries reference_data: list[ReferenceYieldSpec]; each spec drives one loader.
  • CommodityConfig — passed to from_spec as commodity_cfg; used for commodity-specific calendar and unit lookups.
  • MonthDay (lib/calendar.py:16) — consumed by _ReferenceYieldSpecBase.cutoff_month_day and ReferenceYieldLoader.yield_final.
  • ResolvablePath (lib/path_utils.py:60) — spec.filepath is anchored at data_root automatically.
  • CoverageBiasCorrector (MetaModel) — uses load_nass_county_panel_yield_area (a separate NASS loader, not this hierarchy) to estimate coverage bias.
  • DeliveryRowwasde_in_season, conab_final_in_season, conab_lev_in_season columns are populated from yield_asof_array outputs.

Concepts and pipelines

  • source_lib — full survey of the lib/reference_data/ sub-package.
  • ReferenceYieldSpec discriminated union — mirrors EditRuleConfig (lib/edit_and_imputation/edit.py:361) in structure; both use Annotated[..., Field(discriminator="kind")] and are parsed from YAML by ExperimentConfig.

PRs and commits

No reference-data-loader-specific PRs identified in the recent log. The geography field on _ReferenceYieldSpecBase replaced a hardcoded "united_states" filter that previously lived in wasde.py:66.

Open questions

  • ConabLevantamentoLoader infers release dates from _LEV_RELEASE_CALENDAR because the source file carries no publication date column. This calendar requires manual maintenance as CONAB changes its release schedule.
  • ConabFinalLoader ignores the produtividade_mil_ha_mil_t column and recomputes yield from production and area. Any future schema change to the Serie Historica format may silently produce wrong results if the column names change.
  • yield_final_series iterates all marketing years via yield_final in a Python loop, calling _cached_load on each iteration. For large reference files with many years this is O(n × scan) even with caching; yield_asof_array is the hot path for the same data.
  • There is no NassLoader in this hierarchy. NASS county benchmarks are available via lib/reference_data/nass_benchmarks.py but are not surfaced as ReferenceYieldSpec entries; adding NASS as a spec would require a new loader and spec class.