Skip to content

Data lineage — commodity_hindcast

Every input path in commodity_hindcast resolves through the INPUT_DATA_DIR contract: a single env var anchors all relative paths, while absolute and s3:// URIs pass through unchanged (config.py:50, wiki/commodity_hindcast/concepts/input_data_dir_contract.md). Each typed ResolvablePath field on the config tree is resolved at config-load time and existence-checked at preflight (config.py:854, wiki/commodity_hindcast/concepts/resolvable_path.md). Below: every external source the pipeline reads, where it is rooted, and what breaks if it is missing.

External sources

The "Path / config field" column shows the YAML key (where the path lives) and the typical resolved location for INPUT_DATA_DIR=<repo_root> (per MEMORY.md and the contract wiki). Owner / refresh-cadence values are placeholders unless the source code or wiki documents them explicitly.

Source Path / config field Owner Refresh cadence Freshness check Failure mode
NASS yields parquet (corn) commodity.builders.yields.filepath = data/nass/preprocessed_corn.parquet (configs/corn_usa.yaml:192); typed ResolvablePath (config.py:179) [PLACEHOLDER: NASS preprocessing pipeline owner] [PLACEHOLDER: cadence — NASS publishes annually post-harvest] Inspect max(year) after _load_nass pivots; inputs not surfaced in a CLI freshness command Preflight fails closed with a single-line "field + resolved path" error (run/preflight.py:77); YieldsBuilder is not required_for_pred_parquet (configs/corn_usa.yaml:217) so a stale file shifts the last labelled year, but does not block forecast
NASS yields parquet (soybeans) data/nass/soybeans.parquet (configs/soybeans_usa.yaml:189) [PLACEHOLDER] [PLACEHOLDER] Same as corn Same as corn
NASS yields parquet (wheat) data/nass/wheat.parquet (configs/wheat_usa.yaml:210) [PLACEHOLDER] [PLACEHOLDER] Same as corn Same as corn; note MEMORY.md flags that only WHEAT survives preprocessing — sub-types are config-only
NASS yields parquet (cotton) data/nass/preprocessed_cotton.parquet (configs/cotton_usa.yaml:196) [PLACEHOLDER] [PLACEHOLDER] Same as corn Same as corn
IBGE-PAM yields parquet (Brazil soy) data/ibge/soja_brazil_municipios.parquet (configs/soybeans_bra.yaml:253) [PLACEHOLDER: IBGE preprocessing pipeline owner] [PLACEHOLDER: PAM is annual] Inspect parquet max(year); PAM lags ~1 year vs harvest Same as NASS — feeds YieldsBuilder; also drives actuals_source_short=IBGE (configs/soybeans_bra.yaml:179)
Stress parquet (corn, wheat) data/stress/preprocessed_<crop>_stress.parquet (configs/corn_usa.yaml:220, configs/wheat_usa.yaml:265); regenerated from assemble_stress_from_indices.indices_zarr (config.py:208) [PLACEHOLDER: produced by compute_commodity_stress in this repo from the indices zarr] Rebuilt on demand; gated by overwrite flag or COMMODITY_HINDCAST_FORCE_STRESS_ASSEMBLY=1 (configs/corn_usa.yaml:228) File exists ⇒ skipped; otherwise re-derived from indices zarr required_for_pred_parquet: true (configs/corn_usa.yaml:239) ⇒ pipeline aborts at preflight if missing AND indices zarr is also missing
Weather indices zarr (per-crop) s3://{env}-treefera-greenprint-data/weather/processed/indices/conus_adm2_<crop>.zarr (corn configs/corn_usa.yaml:242, wheat :291, cotton configs/cotton_usa.yaml:225); local cleaned copy for Brazil soy (configs/soybeans_bra.yaml:285) and US soy (configs/soybeans_usa.yaml:217) Greenprint weather pipeline [PLACEHOLDER: incremental daily — see incremental-run skill] Open zarr, read time dim max required_for_pred_parquet: true ⇒ stage aborts at preflight (run/preflight.py:77)
Weather-stress (YTD) zarr s3://{env}-treefera-greenprint-data/weather/processed/stress/conus_adm2_<crop>_ytd_stress.zarr (configs/corn_usa.yaml:249, configs/wheat_usa.yaml:302, configs/cotton_usa.yaml:232); local clean copy for Brazil soy (configs/soybeans_bra.yaml:297) Greenprint weather pipeline [PLACEHOLDER: incremental daily] time dim max in zarr; values are pre-cumulated, so the snapshot at each init_date IS the freshness gauge required_for_pred_parquet: true ⇒ preflight abort. Phenoweighted z-score features (z_<crop>_*_phenoweight_cumsum) become unavailable
Climatology indices zarr s3://{env}-treefera-greenprint-data/weather/processed/climo_indices/conus_adm2.zarr (configs/corn_usa.yaml:256, configs/wheat_usa.yaml:309); local copies for soy (configs/soybeans_usa.yaml:230, configs/cotton_usa.yaml:239); Brazil clean copy (configs/soybeans_bra.yaml:305) Greenprint weather pipeline (climatology stage) [PLACEHOLDER: rebuilt when baseline window updates — see wiki/commodity_hindcast/concepts/climo_materialisation.md] Baseline window encoded in filename (e.g. baseline_1980_2025_w31) required_for_pred_parquet: true ⇒ preflight abort; z-score features all become NaN
Materialised climatology zarr (forecast) forecast.materialised_climo_filepath (typed ResolvablePath, config.py:632); e.g. s3://{env}-treefera-greenprint-data/weather/processed/climatology/conus_adm2_baseline_1980_2025_w31_materialised.zarr (configs/corn_usa.yaml:348) Greenprint weather pipeline [PLACEHOLDER: rebuilt when baseline window changes] Baseline window in filename; consumed by materialise_for_forecast(...) per MEMORY.md/feedback_centralised_climo Forecast stage validate_residual_mode rejects with actionable message; hindcast unaffected
Raw observed weather zarr (forecast) forecast.raw_obs_filepath (config.py:631); s3://{env}-treefera-greenprint-data/weather/processed/areal_aggregation/conus_adm2.zarr (configs/corn_usa.yaml:346) — note configs/soybeans_usa.yaml:288 uses an absolute local path /data/processing/weather/... Greenprint weather pipeline (areal aggregation stage) [PLACEHOLDER: incremental daily] time dim max ≥ requested init_date Forecast preflight aborts; hindcast unaffected
WASDE CSV reference_data[kind=wasde].filepath = data/wasde/wasde_<crop>_us_yield.csv (configs/corn_usa.yaml:361, configs/soybeans_usa.yaml:301, configs/wheat_usa.yaml:393, configs/cotton_usa.yaml:309); typed ResolvablePath on _ReferenceYieldSpecBase.filepath (lib/reference_data/base_reference_yield_loader.py:47) USDA WASDE; ingestion owner [PLACEHOLDER] Monthly WASDE release (cutoff_month_day defaults to Feb 1, the marketing-year close, see lib/reference_data/wasde.py:24) Inspect max(release_date) per spec name Empty filter ⇒ WasdeLoader.load raises ValueError (lib/reference_data/wasde.py:67); diagnostics + delivery skip the WASDE column prefix when reference_data is empty (config.py:731)
CONAB Levantamento bulletin reference_data[kind=conab_levantamento].filepath = data/conab/conab_levantamento_graos.txt (configs/soybeans_bra.yaml:375); semicolon, latin-1 (lib/reference_data/conab.py:75) CONAB; ingestion owner [PLACEHOLDER] Monthly id_levantamento / dsc_levantamento per row Same as WASDE — loader raises if filtered frame empty; spec absent ⇒ skipped silently
CONAB Série Histórica reference_data[kind=conab_final].filepath = data/conab/conab_serie_historica_graos.txt (configs/soybeans_bra.yaml:381) CONAB Annual post-harvest safra max Same as Levantamento
Geo boundaries (delivery join) geometry.parquet, located via _resolve_geometry_parquet_path searching config_resolved.yaml's data_root, then INPUT_DATA_DIR, then run_dir (delivery/export.py:98-127) Treefera boundaries pipeline; per MEMORY.md env vars: CROP_YIELD_GEOBOUNDARIES_FILE for crop_yield, but commodity_hindcast resolves locally via INPUT_DATA_DIR [PLACEHOLDER: rebuilt when boundaries refresh] File modtime; row count FileNotFoundError at delivery export (delivery/export.py:124); only the export step needs it — hindcast/forecast core unaffected

Reference yield series

Reference yields drive the WASDE/CONAB metric columns and the in-season comparator on the rolling-forecast plot. Each spec's name becomes the column prefix downstream and must be unique within reference_data (config.py:870-883). Empty list ⇒ no reference series at all (config.py:731).

Spec kind Loader Source unit Emitted unit Cutoff semantics Configured by
wasde WasdeLoader (lib/reference_data/wasde.py:38) bu_acre (default, wasde.py:35) kg/ha (ABC contract, base_reference_yield_loader.py:11) release_date < harvest_year + 1 of cutoff_month_day (default Feb 1, wasde.py:24-28) All four US YAMLs (corn, soy, wheat, cotton)
conab_final ConabFinalLoader (lib/reference_data/conab.py:94) kg_per_ha (conab.py:33) kg/ha cutoff_month_day (Brazil-soy YAML uses Oct 1, configs/soybeans_bra.yaml:385) soybeans_bra.yaml only
conab_levantamento ConabLevantamentoLoader kg_per_ha (conab.py:44) kg/ha cutoff_month_day (Oct 1 in soy-BRA, configs/soybeans_bra.yaml:378) soybeans_bra.yaml only

Order matters in reference_data: the first spec is the in-season comparator on the rolling-forecast plot (configs/soybeans_bra.yaml:366-371).

How resolution works

Every ResolvablePath field is registered by its annotation. At ExperimentConfig construction, _resolve_data_paths walks the config tree via _iter_resolvable_fields(self) and calls resolve_data_path(value, self.data_root) on each (config.py:853-867). Relative values anchor at data_root (the INPUT_DATA_DIR AnyPath, config.py:50); absolute paths and s3:// URIs pass through after expand_env_template substitutes {env} (config.py:82-100). Before any stage runs, run_preflight(preflight_paths_for_<stage>(config)) calls check_path_exists on every yielded path (run/preflight.py:59, :77, :110); a missing input fails with a single line naming the field and the resolved location (DESIGN.md:69). Adding a new ResolvablePath field auto-extends preflight coverage with no changes to preflight code (wiki/commodity_hindcast/concepts/resolvable_path.md).

Known data risks

  • MLflow DB locking — concurrent same-commodity runs corrupt mlruns.db (MEMORY.md "Known Issues"). Run sequentially.
  • Wheat sub-types absent post-preprocessing — only WHEAT survives; WINTER_WHEAT and the spring-wheat split listed in some configs are not produced by the NASS preprocessor (MEMORY.md).
  • Brazil-soy clean-zarr workaround — 4 hash-suffixed identifiers (varzea/quixaba) in the upstream brazil_prod_adm2_brazil_soy_*.zarr fail GEO_ID_PATTERN; locally cleaned copies are used (configs/soybeans_bra.yaml:280-309). If upstream is regenerated without re-cleaning, preflight passes but assert_valid_geo_identifiers fails mid-run.
  • Brazil-soy climo NaN/inf featuresedd_zscore is 100% NaN and dry_days_zscore_gstd carries +inf upstream; both excluded from feature_cols until republished (configs/soybeans_bra.yaml:122-137).
  • TMI selection-bias correction — broken by upstream NaN dropouts (MEMORY.md project_sbc_tmi_bug); SBC formula matches QUBE but core corn-belt counties are dropped from training before SBC is applied.
  • Mixed forecast paths between US YAMLscorn_usa.yaml and wheat_usa.yaml use s3://{env}-... for forecast.raw_obs_filepath; soybeans_usa.yaml hardcodes /data/processing/weather/... (configs/soybeans_usa.yaml:288). Soy will fail in environments without that mount; corn/wheat are environment-portable.
  • Geometry parquet lookup fragility_resolve_geometry_parquet_path searches three locations; in fresh worktrees the symlink to local_data/ is needed (MEMORY.md project_local_data_symlink).

[PLACEHOLDER: upstream owners and formal refresh cadences for NASS, IBGE, WASDE, CONAB, Greenprint weather indices, and the geometry pipeline — none of these are documented in the live code or the wiki concept pages cited above.]