Skip to content

Entity: Yield

Definition

A scalar crop yield measurement or prediction. The pipeline maintains a single internal unit — kilograms per hectare (yield_kg_ha) — throughout all training, prediction, and postprocessing stages. Unit conversion to delivery units (bu/ac for grains; lbs/ac for cotton) happens exclusively at the delivery/ boundary via lib/unit_utils.py. The predicted yield is never a standalone class; it is a named column in the parquet schema, a field on DeliveryRow, and the target of CommodityConfig.target_col.

Kind

Value object (float-valued column in parquets and delivery rows). No dedicated class. The canonical internal column name is yield_kg_ha; the canonical delivery column is mean (delivery/schemas.py:137).

Source of truth

market_insights_models/src/commodity_hindcast/lib/unit_utils.py:33kg_ha_to_bu_acre(kg_ha, bushel_weight_lbs) and bu_acre_to_kg_ha(bu_acre, bushel_weight_lbs) are the single source of truth for unit conversion constants (HA_PER_ACRE = 0.404686, KG_PER_LB = 0.453592). delivery/schemas.py:109DeliveryRow class, with mean: float at line 137 as the primary predicted yield in delivery units. config.py:304CommodityConfig.yield_range: tuple[float, float] defines plausibility bounds in delivery units.

Key attributes / structure

Internal (pipeline) columns:

Column Units Notes
yield_kg_ha kg/ha Primary target; canonical internal representation
target_col kg/ha Config-defined alias for yield_kg_ha (set per commodity)
target_detrended_col kg/ha (detrended residual) After AbstractDetrend.fit_transform; fed to Regressor
sim_yield_kg_ha kg/ha Simulated yield output from Regressor.predict
obs_yield_kg_ha kg/ha Observed yield from NASS/CONAB builder
area_harvested_ha ha Area weight; required for all national aggregations
production_kg kg area_harvested_ha × yield_kg_ha

Delivery (output) columns on DeliveryRow:

Column Units Notes
mean bu/ac or lbs/ac Primary predicted yield in delivery units
lower_{50,68,80,90,95} bu/ac or lbs/ac Conformal lower CI bounds (optional)
upper_{50,68,80,90,95} bu/ac or lbs/ac Conformal upper CI bounds (optional)
nass_actual bu/ac or lbs/ac Observed survey yield benchmark
wasde_in_season bu/ac or lbs/ac WASDE in-season national estimate
conab_final_in_season bu/ac or lbs/ac CONAB final Brazil yield
weather_correction_bu_ac bu/ac National-scale bias correction applied

Unit conversion formula:

bu_acre = kg_ha * HA_PER_ACRE / (bushel_weight_lbs * KG_PER_LB)
# HA_PER_ACRE = 0.404686; KG_PER_LB = 0.453592
# cotton: bushel_weight_lbs = 1.0 → output is lbs/acre

Invariants: - CI ordering: lower_95 ≤ lower_90 ≤ … ≤ mean ≤ … ≤ upper_90 ≤ upper_95 enforced by DeliveryRow._validate_ci_ordering (delivery/schemas.py:176). - Unweighted mean of yields is forbidden — all national aggregations must weight by area_harvested_ha. - yield_range clamping applied by clip_yield_to_delivery_range (lib/unit_utils.py:93) before DeliveryRow construction.

Lifecycle

Created: - Observed yield: loaded by YieldsBuilder from NASS or CONAB parquets; pivoted into the feature matrix with yield_kg_ha as the target column. - Simulated yield: Regressor.predict() produces sim_yield_kg_ha (detrended residual), then Detrender.inverse_transform() recovers yield scale; result written to walk_forward_preds.parquet. - Delivery yield: walk_forward_preds_to_delivery_rows() converts sim_yield_kg_ha to bu/ac via kg_ha_to_bu_acre, applies yield_range clipping, and constructs DeliveryRow.

Consumed: - FIT stage: target_col (yield_kg_ha) is the detrending and regression target. - POSTPROCESS: national aggregation weights by area_harvested_ha; bias corrector adjusts the national mean. - DELIVER: DeliveryRow.mean is the client-facing prediction; CI bands are populated from CalibrationResult.

Destroyed: Never destroyed; each stage writes a new parquet with updated columns rather than modifying in place.

Relationships to other entities

  • Commodity — governs units for — bushel_weight_lbs and delivery_unit on CommodityConfig determine the conversion formula and the variable column string in delivery CSVs
  • Region — weighted by — area-weighted national aggregation uses area_harvested_ha per GeoIdentifier
  • SeasonYear — indexed by — every yield value is keyed by (season_year, geo_identifier) or (season_year, geo_identifier, init_date)
  • InitDate — indexed by — predictions are a function of which weather observations are visible at each init date
  • Fold — evaluated per — walk-forward CV measures prediction skill by comparing sim_yield_kg_ha against obs_yield_kg_ha in each fold's test year

Concepts and pipelines that touch this entity

  • Pipeline: hindcast (P5) — FIT trains on yield_kg_ha; DELIVER outputs mean in delivery units
  • Pipeline: forecast (P5) — forecast prediction uses the same unit conversion path; DeliveryRow is the shared delivery contract
  • Concept: unit conversion (P5) — lib/unit_utils.py constants, vectorised helpers, and delivery-boundary clipping
  • Concept: conformal calibration (P5) — CalibrationResult derives CI half-widths from OOS yield residuals

PRs and commits

  • PR-360 — Fixes a silent factor-67 unit bug in CONAB yield loading (CONAB data was in t/ha, not kg/ha); adds conab_final_in_season and conab_lev_in_season to DeliveryRow
  • PR-331 — Fixes always-null weather_correction_bu_ac (was a structural identity: detrended_residual × unit_scale); adds missing lower_90/upper_90 CI columns to wheat/cotton/soybean configs
  • PR-361 — Introduces CalibrationResult with four residual modes for deriving CI band half-widths from OOS yield residuals

Open questions

  • HA_PER_ACRE = 0.404686 and KG_PER_LB = 0.453592 are 6-decimal truncations; the module docstring notes a ~0.00017 % systematic bias versus the exact legal constants. Should the exact values be adopted?
  • Unweighted national yield averaging is forbidden by convention but is not programmatically blocked — is there a guard at the aggregation call sites?
  • The weather_correction_bu_ac field represents the national bias correction contribution; its interpretation relative to mean is not documented in DeliveryRow's docstring.
  • yield_range clamping at delivery silently modifies predictions that fall outside plausibility bounds; no flag column records that clamping occurred.