Entity: Yield¶
Definition¶
A scalar crop yield measurement or prediction. The pipeline maintains a single internal unit — kilograms per hectare (yield_kg_ha) — throughout all training, prediction, and postprocessing stages. Unit conversion to delivery units (bu/ac for grains; lbs/ac for cotton) happens exclusively at the delivery/ boundary via lib/unit_utils.py. The predicted yield is never a standalone class; it is a named column in the parquet schema, a field on DeliveryRow, and the target of CommodityConfig.target_col.
Kind¶
Value object (float-valued column in parquets and delivery rows). No dedicated class. The canonical internal column name is yield_kg_ha; the canonical delivery column is mean (delivery/schemas.py:137).
Source of truth¶
market_insights_models/src/commodity_hindcast/lib/unit_utils.py:33 — kg_ha_to_bu_acre(kg_ha, bushel_weight_lbs) and bu_acre_to_kg_ha(bu_acre, bushel_weight_lbs) are the single source of truth for unit conversion constants (HA_PER_ACRE = 0.404686, KG_PER_LB = 0.453592).
delivery/schemas.py:109 — DeliveryRow class, with mean: float at line 137 as the primary predicted yield in delivery units.
config.py:304 — CommodityConfig.yield_range: tuple[float, float] defines plausibility bounds in delivery units.
Key attributes / structure¶
Internal (pipeline) columns:
| Column | Units | Notes |
|---|---|---|
yield_kg_ha |
kg/ha | Primary target; canonical internal representation |
target_col |
kg/ha | Config-defined alias for yield_kg_ha (set per commodity) |
target_detrended_col |
kg/ha (detrended residual) | After AbstractDetrend.fit_transform; fed to Regressor |
sim_yield_kg_ha |
kg/ha | Simulated yield output from Regressor.predict |
obs_yield_kg_ha |
kg/ha | Observed yield from NASS/CONAB builder |
area_harvested_ha |
ha | Area weight; required for all national aggregations |
production_kg |
kg | area_harvested_ha × yield_kg_ha |
Delivery (output) columns on DeliveryRow:
| Column | Units | Notes |
|---|---|---|
mean |
bu/ac or lbs/ac | Primary predicted yield in delivery units |
lower_{50,68,80,90,95} |
bu/ac or lbs/ac | Conformal lower CI bounds (optional) |
upper_{50,68,80,90,95} |
bu/ac or lbs/ac | Conformal upper CI bounds (optional) |
nass_actual |
bu/ac or lbs/ac | Observed survey yield benchmark |
wasde_in_season |
bu/ac or lbs/ac | WASDE in-season national estimate |
conab_final_in_season |
bu/ac or lbs/ac | CONAB final Brazil yield |
weather_correction_bu_ac |
bu/ac | National-scale bias correction applied |
Unit conversion formula:
bu_acre = kg_ha * HA_PER_ACRE / (bushel_weight_lbs * KG_PER_LB)
# HA_PER_ACRE = 0.404686; KG_PER_LB = 0.453592
# cotton: bushel_weight_lbs = 1.0 → output is lbs/acre
Invariants:
- CI ordering: lower_95 ≤ lower_90 ≤ … ≤ mean ≤ … ≤ upper_90 ≤ upper_95 enforced by DeliveryRow._validate_ci_ordering (delivery/schemas.py:176).
- Unweighted mean of yields is forbidden — all national aggregations must weight by area_harvested_ha.
- yield_range clamping applied by clip_yield_to_delivery_range (lib/unit_utils.py:93) before DeliveryRow construction.
Lifecycle¶
Created:
- Observed yield: loaded by YieldsBuilder from NASS or CONAB parquets; pivoted into the feature matrix with yield_kg_ha as the target column.
- Simulated yield: Regressor.predict() produces sim_yield_kg_ha (detrended residual), then Detrender.inverse_transform() recovers yield scale; result written to walk_forward_preds.parquet.
- Delivery yield: walk_forward_preds_to_delivery_rows() converts sim_yield_kg_ha to bu/ac via kg_ha_to_bu_acre, applies yield_range clipping, and constructs DeliveryRow.
Consumed:
- FIT stage: target_col (yield_kg_ha) is the detrending and regression target.
- POSTPROCESS: national aggregation weights by area_harvested_ha; bias corrector adjusts the national mean.
- DELIVER: DeliveryRow.mean is the client-facing prediction; CI bands are populated from CalibrationResult.
Destroyed: Never destroyed; each stage writes a new parquet with updated columns rather than modifying in place.
Relationships to other entities¶
- Commodity — governs units for —
bushel_weight_lbsanddelivery_unitonCommodityConfigdetermine the conversion formula and thevariablecolumn string in delivery CSVs - Region — weighted by — area-weighted national aggregation uses
area_harvested_haperGeoIdentifier - SeasonYear — indexed by — every yield value is keyed by
(season_year, geo_identifier)or(season_year, geo_identifier, init_date) - InitDate — indexed by — predictions are a function of which weather observations are visible at each init date
- Fold — evaluated per — walk-forward CV measures prediction skill by comparing
sim_yield_kg_haagainstobs_yield_kg_hain each fold's test year
Concepts and pipelines that touch this entity¶
- Pipeline: hindcast (P5) — FIT trains on
yield_kg_ha; DELIVER outputsmeanin delivery units - Pipeline: forecast (P5) — forecast prediction uses the same unit conversion path;
DeliveryRowis the shared delivery contract - Concept: unit conversion (P5) —
lib/unit_utils.pyconstants, vectorised helpers, and delivery-boundary clipping - Concept: conformal calibration (P5) —
CalibrationResultderives CI half-widths from OOS yield residuals
PRs and commits¶
- PR-360 — Fixes a silent factor-67 unit bug in CONAB yield loading (CONAB data was in t/ha, not kg/ha); adds
conab_final_in_seasonandconab_lev_in_seasontoDeliveryRow - PR-331 — Fixes always-null
weather_correction_bu_ac(was a structural identity:detrended_residual × unit_scale); adds missinglower_90/upper_90CI columns to wheat/cotton/soybean configs - PR-361 — Introduces
CalibrationResultwith four residual modes for deriving CI band half-widths from OOS yield residuals
Open questions¶
HA_PER_ACRE = 0.404686andKG_PER_LB = 0.453592are 6-decimal truncations; the module docstring notes a ~0.00017 % systematic bias versus the exact legal constants. Should the exact values be adopted?- Unweighted national yield averaging is forbidden by convention but is not programmatically blocked — is there a guard at the aggregation call sites?
- The
weather_correction_bu_acfield represents the national bias correction contribution; its interpretation relative tomeanis not documented inDeliveryRow's docstring. yield_rangeclamping at delivery silently modifies predictions that fall outside plausibility bounds; no flag column records that clamping occurred.