ReferenceYieldLoader¶
Definition¶
A ReferenceYieldLoader loads an external reference yield time series — WASDE, CONAB final, or CONAB levantamento — and exposes it as a standardised DataFrame keyed by (marketing_year, release_date) with values in kg/ha. The ABC is ReferenceYieldLoader at lib/reference_data/base_reference_yield_loader.py:77. Each loader is paired with a corresponding spec class; a discriminated union ReferenceYieldSpec (lib/reference_data/loader.py:59) drives YAML parsing and dispatch.
The ABC contract is kg/ha always. Concrete loaders convert from their source unit on read; consumers see kg/ha only.
NASS is not a ReferenceYieldLoader. NASS county yield is loaded as a feature via YieldsBuilder (FeatureBuilder), not via this hierarchy. The NassSpec name in early design documents has no code backing.
Kind¶
ABC (ReferenceYieldLoader at lib/reference_data/base_reference_yield_loader.py:77).
Source of truth¶
market_insights_models/src/commodity_hindcast/lib/reference_data/base_reference_yield_loader.py:77
Spec base¶
_ReferenceYieldSpecBase (base_reference_yield_loader.py:36) is the frozen Pydantic base shared by every ReferenceYieldSpec subclass:
| Field | Type | Purpose |
|---|---|---|
name |
str |
Metric / column prefix downstream (e.g. wasde_in_season). |
filepath |
ResolvablePath |
Anchored at data_root; preflight-checked automatically. |
commodity |
str |
Lower-case commodity name (e.g. "soybeans", "corn"). |
geography |
str |
Lower-case geography token (e.g. "united_states", "brazil"). |
cutoff_month_day |
MonthDay |
Calendar day after which the value for harvest_year is treated as final. |
unit |
Literal["bu_acre", "kg_per_ha"] |
Source unit; WasdeRefSpec overrides to "bu_acre". Default "kg_per_ha". |
Required interface¶
| Method | Signature | Notes |
|---|---|---|
load (abstract) |
() → DataFrame |
Returns release frame with columns marketing_year, release_date, geography, commodity, variable, value (kg/ha). Sorted by (marketing_year, release_date); no duplicates. Empty input → ValueError. |
yield_asof (concrete) |
(harvest_year, init_date) → float \| None |
Latest release strictly before init_date for harvest_year. base_reference_yield_loader.py:163. |
yield_final (concrete) |
(harvest_year) → float \| None |
Last release before spec.cutoff_month_day of harvest_year. base_reference_yield_loader.py:185. |
yield_asof_array (concrete) |
(harvest_year, init_dates) → ndarray |
Vectorised yield_asof for a fixed harvest_year. Hot path for delivery and diagnostics. base_reference_yield_loader.py:202. |
yield_final_series (concrete) |
() → Series |
Series of finals indexed by marketing_year; name is f"{spec.name}_final_kg_ha". base_reference_yield_loader.py:220. |
The load() result is cached on self._df at first call; subsequent lookups return the cached frame without re-parsing the source file.
Dispatch¶
Registry: ReferenceYieldLoader._registry (base_reference_yield_loader.py:97) is a class-level dict[type[_ReferenceYieldSpecBase], type[ReferenceYieldLoader]] populated by explicit ReferenceYieldLoader.register(spec_type, loader_type) calls from loader.py after the concrete modules are imported. Explicit registration (rather than __init_subclass__ magic) keeps the dispatch table greppable.
Factory: ReferenceYieldLoader.from_spec(spec, commodity_cfg) (base_reference_yield_loader.py:129) looks up type(spec) in _registry and instantiates the matched loader. Raises TypeError on unknown spec type.
YAML entry point: ReferenceYieldSpec at loader.py:59:
ReferenceYieldSpec = Annotated[
WasdeRefSpec | ConabFinalRefSpec | ConabLevantamentoRefSpec,
Field(discriminator="kind"),
]
Parsed from ExperimentConfig.reference_data: list[ReferenceYieldSpec]. Each name must be unique within the list (validated at config load, config.py:799).
Concrete implementations¶
| Spec class | Loader class | kind |
File | Source format | Unit | Notes |
|---|---|---|---|---|---|---|
WasdeRefSpec |
WasdeLoader |
wasde |
wasde.py:25 / wasde.py:32 |
WASDE CSV | bu_acre (default for WasdeRefSpec); converted to kg/ha on read |
Filters by geography/commodity/variable; parses marketing_year from "YYYY/YY" strings. |
ConabFinalRefSpec |
ConabFinalLoader |
conab_final |
conab.py:25 / conab.py:94 |
CONAB Serie Historica; latin-1, semicolon-delimited | Source-native (t/ha); converted to kg/ha on read | Aggregates UF rows to national by summing production and area then recomputing yield; produtividade_mil_ha_mil_t column ignored. Overrides yield_final because release_date equals the cutoff. |
ConabLevantamentoRefSpec |
ConabLevantamentoLoader |
conab_levantamento |
conab.py:37 / conab.py:197 |
12 monthly CONAB Levantamento bulletins per safra; no publication date column | Source-native; converted to kg/ha on read | Release dates derived from _LEV_RELEASE_CALENDAR (conab.py:167) — a module-level dict mapping LEV ordinal (1–12) to (calendar_month, year_offset) anchored against three published bulletins. |
Note on NASS: NASS county panel data is loaded by build_yields (FeatureBuilder) as a training feature, and by load_nass_obs / load_nass_county_panel_yield_area (lib/reference_data/nass.py) for bias-correction fitting. It does not flow through the ReferenceYieldLoader hierarchy.
Lifecycle¶
Instantiation: build_loaders(cfg) at loader.py:68 constructs one ReferenceYieldLoader per spec on cfg.reference_data. Single centralised dispatch so delivery, diagnostics, and the fold generator share the same loader list.
Pre-build: build_references_by_harvest_year(cfg) at loader.py:78 builds a spec.name → harvest_year → release DataFrame dispatch table for ExpandingFoldGenerator; avoids re-loading source files per fold.
Invocation: Loaders are called at delivery time (delivery/conversions.py) and in diagnostic runners (diagnostics/runners.py) via yield_asof_array (vectorised hot path) and yield_final_series.
Tear-down: None. The _df cache is per-instance; it is released when the loader object goes out of scope.
Relationships¶
ExperimentConfig— carriesreference_data: list[ReferenceYieldSpec]; each spec drives one loader.CommodityConfig— passed tofrom_specascommodity_cfg; used for commodity-specific calendar and unit lookups.MonthDay(lib/calendar.py:16) — consumed by_ReferenceYieldSpecBase.cutoff_month_dayandReferenceYieldLoader.yield_final.ResolvablePath(lib/path_utils.py:60) —spec.filepathis anchored atdata_rootautomatically.CoverageBiasCorrector(MetaModel) — usesload_nass_county_panel_yield_area(a separate NASS loader, not this hierarchy) to estimate coverage bias.DeliveryRow—wasde_in_season,conab_final_in_season,conab_lev_in_seasoncolumns are populated fromyield_asof_arrayoutputs.
Concepts and pipelines¶
- source_lib — full survey of the
lib/reference_data/sub-package. ReferenceYieldSpecdiscriminated union — mirrorsEditRuleConfig(lib/edit_and_imputation/edit.py:361) in structure; both useAnnotated[..., Field(discriminator="kind")]and are parsed from YAML byExperimentConfig.
PRs and commits¶
No reference-data-loader-specific PRs identified in the recent log. The geography field on _ReferenceYieldSpecBase replaced a hardcoded "united_states" filter that previously lived in wasde.py:66.
Open questions¶
ConabLevantamentoLoaderinfers release dates from_LEV_RELEASE_CALENDARbecause the source file carries no publication date column. This calendar requires manual maintenance as CONAB changes its release schedule.ConabFinalLoaderignores theprodutividade_mil_ha_mil_tcolumn and recomputes yield from production and area. Any future schema change to the Serie Historica format may silently produce wrong results if the column names change.yield_final_seriesiterates all marketing years viayield_finalin a Python loop, calling_cached_loadon each iteration. For large reference files with many years this is O(n × scan) even with caching;yield_asof_arrayis the hot path for the same data.- There is no
NassLoaderin this hierarchy. NASS county benchmarks are available vialib/reference_data/nass_benchmarks.pybut are not surfaced asReferenceYieldSpecentries; adding NASS as a spec would require a new loader and spec class.