Data lineage — commodity_hindcast¶
Every input path in commodity_hindcast resolves through the INPUT_DATA_DIR contract: a single env var anchors all relative paths, while absolute and s3:// URIs pass through unchanged (config.py:50, wiki/commodity_hindcast/concepts/input_data_dir_contract.md). Each typed ResolvablePath field on the config tree is resolved at config-load time and existence-checked at preflight (config.py:854, wiki/commodity_hindcast/concepts/resolvable_path.md). Below: every external source the pipeline reads, where it is rooted, and what breaks if it is missing.
External sources¶
The "Path / config field" column shows the YAML key (where the path lives) and the typical resolved location for INPUT_DATA_DIR=<repo_root> (per MEMORY.md and the contract wiki). Owner / refresh-cadence values are placeholders unless the source code or wiki documents them explicitly.
| Source | Path / config field | Owner | Refresh cadence | Freshness check | Failure mode |
|---|---|---|---|---|---|
| NASS yields parquet (corn) | commodity.builders.yields.filepath = data/nass/preprocessed_corn.parquet (configs/corn_usa.yaml:192); typed ResolvablePath (config.py:179) |
[PLACEHOLDER: NASS preprocessing pipeline owner] | [PLACEHOLDER: cadence — NASS publishes annually post-harvest] | Inspect max(year) after _load_nass pivots; inputs not surfaced in a CLI freshness command |
Preflight fails closed with a single-line "field + resolved path" error (run/preflight.py:77); YieldsBuilder is not required_for_pred_parquet (configs/corn_usa.yaml:217) so a stale file shifts the last labelled year, but does not block forecast |
| NASS yields parquet (soybeans) | data/nass/soybeans.parquet (configs/soybeans_usa.yaml:189) |
[PLACEHOLDER] | [PLACEHOLDER] | Same as corn | Same as corn |
| NASS yields parquet (wheat) | data/nass/wheat.parquet (configs/wheat_usa.yaml:210) |
[PLACEHOLDER] | [PLACEHOLDER] | Same as corn | Same as corn; note MEMORY.md flags that only WHEAT survives preprocessing — sub-types are config-only |
| NASS yields parquet (cotton) | data/nass/preprocessed_cotton.parquet (configs/cotton_usa.yaml:196) |
[PLACEHOLDER] | [PLACEHOLDER] | Same as corn | Same as corn |
| IBGE-PAM yields parquet (Brazil soy) | data/ibge/soja_brazil_municipios.parquet (configs/soybeans_bra.yaml:253) |
[PLACEHOLDER: IBGE preprocessing pipeline owner] | [PLACEHOLDER: PAM is annual] | Inspect parquet max(year); PAM lags ~1 year vs harvest |
Same as NASS — feeds YieldsBuilder; also drives actuals_source_short=IBGE (configs/soybeans_bra.yaml:179) |
| Stress parquet (corn, wheat) | data/stress/preprocessed_<crop>_stress.parquet (configs/corn_usa.yaml:220, configs/wheat_usa.yaml:265); regenerated from assemble_stress_from_indices.indices_zarr (config.py:208) |
[PLACEHOLDER: produced by compute_commodity_stress in this repo from the indices zarr] |
Rebuilt on demand; gated by overwrite flag or COMMODITY_HINDCAST_FORCE_STRESS_ASSEMBLY=1 (configs/corn_usa.yaml:228) |
File exists ⇒ skipped; otherwise re-derived from indices zarr | required_for_pred_parquet: true (configs/corn_usa.yaml:239) ⇒ pipeline aborts at preflight if missing AND indices zarr is also missing |
| Weather indices zarr (per-crop) | s3://{env}-treefera-greenprint-data/weather/processed/indices/conus_adm2_<crop>.zarr (corn configs/corn_usa.yaml:242, wheat :291, cotton configs/cotton_usa.yaml:225); local cleaned copy for Brazil soy (configs/soybeans_bra.yaml:285) and US soy (configs/soybeans_usa.yaml:217) |
Greenprint weather pipeline | [PLACEHOLDER: incremental daily — see incremental-run skill] |
Open zarr, read time dim max |
required_for_pred_parquet: true ⇒ stage aborts at preflight (run/preflight.py:77) |
| Weather-stress (YTD) zarr | s3://{env}-treefera-greenprint-data/weather/processed/stress/conus_adm2_<crop>_ytd_stress.zarr (configs/corn_usa.yaml:249, configs/wheat_usa.yaml:302, configs/cotton_usa.yaml:232); local clean copy for Brazil soy (configs/soybeans_bra.yaml:297) |
Greenprint weather pipeline | [PLACEHOLDER: incremental daily] | time dim max in zarr; values are pre-cumulated, so the snapshot at each init_date IS the freshness gauge |
required_for_pred_parquet: true ⇒ preflight abort. Phenoweighted z-score features (z_<crop>_*_phenoweight_cumsum) become unavailable |
| Climatology indices zarr | s3://{env}-treefera-greenprint-data/weather/processed/climo_indices/conus_adm2.zarr (configs/corn_usa.yaml:256, configs/wheat_usa.yaml:309); local copies for soy (configs/soybeans_usa.yaml:230, configs/cotton_usa.yaml:239); Brazil clean copy (configs/soybeans_bra.yaml:305) |
Greenprint weather pipeline (climatology stage) | [PLACEHOLDER: rebuilt when baseline window updates — see wiki/commodity_hindcast/concepts/climo_materialisation.md] |
Baseline window encoded in filename (e.g. baseline_1980_2025_w31) |
required_for_pred_parquet: true ⇒ preflight abort; z-score features all become NaN |
| Materialised climatology zarr (forecast) | forecast.materialised_climo_filepath (typed ResolvablePath, config.py:632); e.g. s3://{env}-treefera-greenprint-data/weather/processed/climatology/conus_adm2_baseline_1980_2025_w31_materialised.zarr (configs/corn_usa.yaml:348) |
Greenprint weather pipeline | [PLACEHOLDER: rebuilt when baseline window changes] | Baseline window in filename; consumed by materialise_for_forecast(...) per MEMORY.md/feedback_centralised_climo |
Forecast stage validate_residual_mode rejects with actionable message; hindcast unaffected |
| Raw observed weather zarr (forecast) | forecast.raw_obs_filepath (config.py:631); s3://{env}-treefera-greenprint-data/weather/processed/areal_aggregation/conus_adm2.zarr (configs/corn_usa.yaml:346) — note configs/soybeans_usa.yaml:288 uses an absolute local path /data/processing/weather/... |
Greenprint weather pipeline (areal aggregation stage) | [PLACEHOLDER: incremental daily] | time dim max ≥ requested init_date |
Forecast preflight aborts; hindcast unaffected |
| WASDE CSV | reference_data[kind=wasde].filepath = data/wasde/wasde_<crop>_us_yield.csv (configs/corn_usa.yaml:361, configs/soybeans_usa.yaml:301, configs/wheat_usa.yaml:393, configs/cotton_usa.yaml:309); typed ResolvablePath on _ReferenceYieldSpecBase.filepath (lib/reference_data/base_reference_yield_loader.py:47) |
USDA WASDE; ingestion owner [PLACEHOLDER] | Monthly WASDE release (cutoff_month_day defaults to Feb 1, the marketing-year close, see lib/reference_data/wasde.py:24) |
Inspect max(release_date) per spec name |
Empty filter ⇒ WasdeLoader.load raises ValueError (lib/reference_data/wasde.py:67); diagnostics + delivery skip the WASDE column prefix when reference_data is empty (config.py:731) |
| CONAB Levantamento bulletin | reference_data[kind=conab_levantamento].filepath = data/conab/conab_levantamento_graos.txt (configs/soybeans_bra.yaml:375); semicolon, latin-1 (lib/reference_data/conab.py:75) |
CONAB; ingestion owner [PLACEHOLDER] | Monthly | id_levantamento / dsc_levantamento per row |
Same as WASDE — loader raises if filtered frame empty; spec absent ⇒ skipped silently |
| CONAB Série Histórica | reference_data[kind=conab_final].filepath = data/conab/conab_serie_historica_graos.txt (configs/soybeans_bra.yaml:381) |
CONAB | Annual post-harvest | safra max |
Same as Levantamento |
| Geo boundaries (delivery join) | geometry.parquet, located via _resolve_geometry_parquet_path searching config_resolved.yaml's data_root, then INPUT_DATA_DIR, then run_dir (delivery/export.py:98-127) |
Treefera boundaries pipeline; per MEMORY.md env vars: CROP_YIELD_GEOBOUNDARIES_FILE for crop_yield, but commodity_hindcast resolves locally via INPUT_DATA_DIR |
[PLACEHOLDER: rebuilt when boundaries refresh] | File modtime; row count | FileNotFoundError at delivery export (delivery/export.py:124); only the export step needs it — hindcast/forecast core unaffected |
Reference yield series¶
Reference yields drive the WASDE/CONAB metric columns and the in-season comparator on the rolling-forecast plot. Each spec's name becomes the column prefix downstream and must be unique within reference_data (config.py:870-883). Empty list ⇒ no reference series at all (config.py:731).
| Spec kind | Loader | Source unit | Emitted unit | Cutoff semantics | Configured by |
|---|---|---|---|---|---|
wasde |
WasdeLoader (lib/reference_data/wasde.py:38) |
bu_acre (default, wasde.py:35) |
kg/ha (ABC contract, base_reference_yield_loader.py:11) |
release_date < harvest_year + 1 of cutoff_month_day (default Feb 1, wasde.py:24-28) |
All four US YAMLs (corn, soy, wheat, cotton) |
conab_final |
ConabFinalLoader (lib/reference_data/conab.py:94) |
kg_per_ha (conab.py:33) |
kg/ha |
cutoff_month_day (Brazil-soy YAML uses Oct 1, configs/soybeans_bra.yaml:385) |
soybeans_bra.yaml only |
conab_levantamento |
ConabLevantamentoLoader |
kg_per_ha (conab.py:44) |
kg/ha |
cutoff_month_day (Oct 1 in soy-BRA, configs/soybeans_bra.yaml:378) |
soybeans_bra.yaml only |
Order matters in reference_data: the first spec is the in-season comparator on the rolling-forecast plot (configs/soybeans_bra.yaml:366-371).
How resolution works¶
Every ResolvablePath field is registered by its annotation. At ExperimentConfig construction, _resolve_data_paths walks the config tree via _iter_resolvable_fields(self) and calls resolve_data_path(value, self.data_root) on each (config.py:853-867). Relative values anchor at data_root (the INPUT_DATA_DIR AnyPath, config.py:50); absolute paths and s3:// URIs pass through after expand_env_template substitutes {env} (config.py:82-100). Before any stage runs, run_preflight(preflight_paths_for_<stage>(config)) calls check_path_exists on every yielded path (run/preflight.py:59, :77, :110); a missing input fails with a single line naming the field and the resolved location (DESIGN.md:69). Adding a new ResolvablePath field auto-extends preflight coverage with no changes to preflight code (wiki/commodity_hindcast/concepts/resolvable_path.md).
Known data risks¶
- MLflow DB locking — concurrent same-commodity runs corrupt
mlruns.db(MEMORY.md "Known Issues"). Run sequentially. - Wheat sub-types absent post-preprocessing — only
WHEATsurvives;WINTER_WHEATand the spring-wheat split listed in some configs are not produced by the NASS preprocessor (MEMORY.md). - Brazil-soy clean-zarr workaround — 4 hash-suffixed identifiers (
varzea/quixaba) in the upstreambrazil_prod_adm2_brazil_soy_*.zarrfailGEO_ID_PATTERN; locally cleaned copies are used (configs/soybeans_bra.yaml:280-309). If upstream is regenerated without re-cleaning, preflight passes butassert_valid_geo_identifiersfails mid-run. - Brazil-soy climo NaN/inf features —
edd_zscoreis 100% NaN anddry_days_zscore_gstdcarries +inf upstream; both excluded fromfeature_colsuntil republished (configs/soybeans_bra.yaml:122-137). - TMI selection-bias correction — broken by upstream NaN dropouts (MEMORY.md
project_sbc_tmi_bug); SBC formula matches QUBE but core corn-belt counties are dropped from training before SBC is applied. - Mixed forecast paths between US YAMLs —
corn_usa.yamlandwheat_usa.yamluses3://{env}-...forforecast.raw_obs_filepath;soybeans_usa.yamlhardcodes/data/processing/weather/...(configs/soybeans_usa.yaml:288). Soy will fail in environments without that mount; corn/wheat are environment-portable. - Geometry parquet lookup fragility —
_resolve_geometry_parquet_pathsearches three locations; in fresh worktrees the symlink tolocal_data/is needed (MEMORY.mdproject_local_data_symlink).
[PLACEHOLDER: upstream owners and formal refresh cadences for NASS, IBGE, WASDE, CONAB, Greenprint weather indices, and the geometry pipeline — none of these are documented in the live code or the wiki concept pages cited above.]