PR #360 — feat(commodity_hindcast): BRAZIL SOY¶
At a glance¶
- Author: ai-tommytf
- Merged: 2026-05-02
- Branch:
tl/brazil_soy - Net effect: Replaces the single hard-coded
evaluation.wasde_pathonExperimentConfigwithreference_data: list[ReferenceYieldSpec]— a discriminated union supporting WASDE (US), CONAB Levantamento, CONAB Série Histórica, and any future national reference series. IntroducesReferenceYieldLoaderABC and concrete subclasses. Addsconfigs/brazil_soybean.yaml. All 22 consumer sites already iteratecfg.reference_data— adding a new source requires only a new spec + loader + YAML entry, with no consumer changes. - Why this matters: Brazil soy was previously shimmed with a fake WASDE-shaped CSV carrying a silent factor-67 unit bug; this PR makes non-US reference data a proper first-class citizen of the pipeline.
PR body (faithful extract)¶
What this PR does — in one sentence¶
It rips out a single hard-coded WASDE-shaped USDA reference path from the experiment config and
replaces it with a discriminated-union list of reference yield specs that can describe **any**
national reference series — WASDE for the US, CONAB for Brazil, IBGE-PAM for Brazil-truth, and
(future) anything else — without further code changes.
Why we needed it¶
Until this PR every yield-reference lookup was hard-coded to the USDA WASDE schema:
- geography == "united_states" filter literal at lib/reference_data/wasde.py:66
- February 1st cutoff literal at wasde.py:106
- bu/ac unit assumption everywhere downstream
- A single evaluation.wasde_path field on the experiment config
For Brazil, the workaround was a fake CSV pretending to be a US WASDE file with a silent factor-67 unit bug.
Before / after mental model¶
Before:
ExperimentConfig
└── evaluation: EvaluationConfig
└── wasde_path: AnyPath ← single hardcoded WASDE CSV
│
▼
lib/reference_data/wasde.py ← hardcoded "united_states" filter
hardcoded Feb-1 cutoff
hardcoded bu/ac unit
After:
ExperimentConfig
└── reference_data: list[ReferenceYieldSpec]
├── kind: wasde ← geography, cutoff, unit declared per-spec
├── kind: conab_levantamento
└── kind: conab_final
│
▼
ReferenceYieldLoader (ABC)
├── WasdeLoader ← bu/ac → kg/ha at read-time
├── ConabLevantamentoLoader ← latin-1 + LEV calendar
└── ConabFinalLoader ← Série Histórica per safra
The new config schema¶
US YAML (one spec, preserves prior behaviour):
reference_data:
- kind: wasde
name: wasde
filepath: data/wasde/wasde_soybeans_us_yield.csv
commodity: soybeans
geography: united_states
cutoff_month_day: {month: 2, day: 1}
unit: bu_acre
Brazil soy YAML (two specs; conab_lev is primary in-season comparator):
reference_data:
- kind: conab_levantamento
name: conab_lev
filepath: data/conab/conab_levantamento_graos.txt
commodity: soybeans
geography: brazil
cutoff_month_day: { month: 10, day: 1 }
unit: kg_per_ha
- kind: conab_final
name: conab_final
filepath: data/conab/conab_serie_historica_graos.txt
commodity: soybeans
geography: brazil
cutoff_month_day: { month: 10, day: 1 }
unit: kg_per_ha
Spec field roles:
| field | role |
|---|---|
kind |
discriminated-union tag — picks the concrete loader class |
name |
drives downstream column / metric prefixes (e.g. conab_lev_in_season); must be unique |
filepath |
ResolvablePath — anchored at data_root; preflight-checked automatically |
commodity |
filters multi-commodity sources |
geography |
replaces the "united_states" literal |
cutoff_month_day |
replaces the hard-coded Timestamp(year, 2, 1) |
unit |
source-of-truth unit; loader converts to canonical kg/ha at read-time |
The loader contract¶
class ReferenceYieldLoader:
def load(self) -> pd.DataFrame: ... # abstract — each subclass owns parse+filter+convert
def yield_asof(self, harvest_year, init_date) -> float | None: ...
def yield_final(self, harvest_year) -> float | None: ...
def yield_asof_array(...) -> ...: ...
def yield_final_series(self) -> pd.Series: ...
The ABC contract is kg/ha throughout. Conversion to bu/ac happens at the delivery / dashboard boundary.
What broke and how it was found (three rounds)¶
Round 1 — the type bug: polars.exceptions.InvalidOperationError: 'clip' only supports physical numeric types. Root cause: conab_final with cutoff_month_day: {month: 10} returned None for all in-season inits; polars inferred Null dtype; clip_yield_to_delivery_range raised. Fix: cast to Float64 at construction. Commit 06b43eac, +7 LOC.
Round 2 — "ALL HISTORY" body was empty. Two sub-problems: (1) ConabFinalLoader.yield_final used a strict release_date < cutoff filter that excluded the one row whose release_date == cutoff; fixed by matching on marketing_year directly. (2) "NASS" hard-coded in three header lines; fixed by CommodityConfig.actuals_source_short / _label with Brazil override to IBGE.
Round 3 — YAML spec ordering. With conab_final first it became the primary in-season comparator, but that source is post-harvest so its in-season column was all-None. Swapped to conab_lev first.
stage5_metrics.txt after the fix (Brazil run)¶
IBGE benchmark: national area-weighted survey yield (bu/acre) per harvest year
IBGE values (bu/acre): {2020: 48.7, 2021: 51.23, 2022: 43.93, 2023: 50.93, 2024: 46.8}
ALL HISTORY (2020-2024, 5 OOS years)
==================================================================================
Fold Model | CONAB_LEV | Improv% Win
vs_IBGE vs_CONAB_LEV | vs_IBGE vs_CONAB_LEV |
----------------------------------------------------------------------------------
w40 3.82 3.49 | 4.29 3.01 | 11.0% 3/5
...
OVERALL 3.49 3.61 | 3.49 2.27 | -0.1%
How to add a future source (e.g. Argentina BCRA)¶
- Add
BcraRefSpec(discriminated bykind: "bcra") tolib/reference_data/loader.py. - Add
BcraLoader(ReferenceYieldLoader)with its ownload(). - Wire into the
ReferenceYieldSpecunion andfrom_specdispatch. - Add the spec to
configs/argentina_soybean.yaml.
No edits to diagnostics/, delivery/, stages/ or run/.
Test results¶
Files / lines touched¶
| Additions | Deletions | File |
|---|---|---|
| +375 | -0 | market_insights_models/src/commodity_hindcast/configs/brazil_soybean.yaml |
| +246 | -0 | market_insights_models/src/commodity_hindcast/lib/reference_data/base_reference_yield_loader.py |
| +245 | -0 | market_insights_models/src/commodity_hindcast/lib/reference_data/conab.py |
| +132 | -61 | market_insights_models/src/commodity_hindcast/diagnostics/runners.py |
| +110 | -59 | market_insights_models/src/commodity_hindcast/diagnostics/plots/fns/delivery.py |
| +67 | -94 | market_insights_models/src/commodity_hindcast/lib/reference_data/wasde.py |
| +119 | -0 | market_insights_models/src/commodity_hindcast/lib/reference_data/loader.py |
| +64 | -35 | market_insights_models/src/commodity_hindcast/diagnostics/metrics.py |
| +60 | -35 | market_insights_models/src/commodity_hindcast/delivery/conversions.py |
| +45 | -35 | market_insights_models/src/commodity_hindcast/config.py |
Cross-references¶
- Related entity pages: ExperimentConfig, CommodityConfig, ReferenceYieldLoader
- Related concept pages: unit conventions, reference data discriminated union
- Related PR: PR-363 (dashboard fix that had to migrate to the new WasdeLoader API introduced here)
Lessons captured¶
evaluation.wasde_pathis deleted; consumers must usecfg.reference_data(alist[ReferenceYieldSpec]).- The ABC contract is kg/ha throughout; each loader converts on read.
- The first spec in
reference_datais the primary in-season comparator; ordering matters. ConabFinalLoader.yield_finaluses marketing-year matching (not strictrelease_date < cutoff) because the Série Histórica pinsrelease_dateequal to the cutoff.CommodityConfig.actuals_source_short/_labeldefault to"NASS"/"NASS Survey Yield (area-weighted)"; Brazil overrides toIBGE/"IBGE-PAM municipal yield (area-weighted)".delivery/conversions.pystill aliases the obs-yield column to"nass_actual"regardless of geography — values are correct but the label is misleading for non-US runs; tracked as a follow-up.- 34 new tests in
tests/unit/commodity_hindcast/lib/reference_data/cover the discriminator round-trip, name-uniqueness validator, ABC dispatch, geography/commodity filters, bu/ac→kg/ha conversion, cutoff boundary behaviour, and ConabFinal year-mapping.