Brazil Soybean (Soja) Experiment Config¶
Top-level fields¶
| Field | Value | Meaning |
|---|---|---|
random_seed |
42 | Global RNG seed |
mlflow_tracking_uri |
sqlite:///mlruns.db |
Local SQLite MLflow store |
experiment_name |
brazil_soybean_yield_prediction |
MLflow experiment name |
feature_start_year |
1990 | IBGE-PAM has data from 1974 but soja is sparse before 1990 |
feature_end_year |
2025 | Latest feature year |
check_data_exists |
[] |
No extra preflight checks |
commodity.commodity |
soybeans |
Internal key kept as soybeans — required by COMMODITY_STRESS_VARIABLES branch in stress_compute.py |
commodity.country_code |
BRA |
ISO-3 code; drives geo_identifier ADM0 segment and experiment_key |
commodity.season_start |
month 9, day 1 | Season epoch: 1 Sep of the prior calendar year |
commodity.season_start_year_offset |
-1 | Cross-year: season_start_date(2024) = Sep 1, 2023 |
commodity.harvest_season_doy |
212 | End of productive GS: Mar 31 (Sep-1 + 212 d) |
commodity.bushel_weight_lbs |
60.0 | Universal soybean bushel weight |
commodity.delivery_unit |
bu_acre |
Kept as bu_acre to avoid breaking hardcoded column-rename map in delivery/export.py |
commodity.yield_range |
[10.0, 100.0] |
Bounds in bu/acre; PAM range [800, 6500] kg/ha maps to ~[11.9, 96.6] bu/acre |
commodity.actuals_source_short |
IBGE |
Short label used in metrics output (vs_IBGE) |
commodity.actuals_source_label |
IBGE-PAM municipal yield (area-weighted) |
Full label for plots |
commodity.freeze_cap_sdoy |
184 | 184 d from Sep 1 = early March; same numeric cap as US configs |
experiment_protocol.cv_strategy |
expanding |
Walk-forward expanding CV |
experiment_protocol.test_years |
2020–2024 | Hold-out years |
experiment_protocol.production_cumulative_threshold |
0.90 | Top-90% of municípios by production |
experiment_protocol.production_recent_years |
5 | Years for production ranking |
model.detrend |
partial_pooling |
Partial-pooling detrender |
model.detrend_params |
{} |
No fixed slope |
model.regression |
pca_ridge |
PCA(2) → Ridge |
model.weather_correction_fit_level |
ADM0 |
National-level weather correction |
model.regression_params.n_components |
2 | PCA components |
model.regression_params.alpha |
10.0 | Ridge regularisation |
model.regression_params.nan_policy |
raise |
Regressors reject NaNs |
forecast.residual_mode |
hindcast_oos_per_init_date |
Conformal calibration mode |
postprocess.bias_corrector.kind |
none |
No bias correction |
delivery.model_public_name |
TFFS_BRA_SOJA_V0 |
Distinct public model name (only non-TFFS_V0 label) |
delivery.ci_levels |
[0.5, 0.68, 0.80, 0.90, 0.95] |
Conformal interval levels |
Builders¶
| Builder | Kind | Notable parameters |
|---|---|---|
yields |
YieldsBuilder |
filepath: data/ibge/soja_brazil_municipios.parquet — IBGE-PAM source (not NASS); standard 2-rule edit chain; required_for_pred_parquet: false |
weather |
WeatherBuilder |
filepath: data/weather/indices/brazil_prod_adm2_brazil_soy_clean.zarr — cleaned copy (4 hash-disambiguated geoids removed); required_for_pred_parquet: true |
weather_stress |
WeatherBuilder |
filepath: data/weather/stress/brazil_prod_adm2_brazil_soy_ytd_stress_clean.zarr — locally-cleaned copy of the upstream S3 ytd-stress zarr (same 4 hash-disambiguated geoids removed); required_for_pred_parquet: true |
climo |
ClimoBuilder |
filepath: data/weather/climo_indices/brazil_prod_adm2_clean.zarr; geo_id_col: identifier (not geoid as in US zarrs); required_for_pred_parquet: true |
Has a weather_stress builder (no stress parquet builder); reads data/weather/stress/brazil_prod_adm2_brazil_soy_ytd_stress_clean.zarr directly — the locally-cleaned copy of the S3 upstream brazil_prod_adm2_brazil_soy_ytd_stress.zarr. The climo builder uses geo_id_col: "identifier" — a key difference from all US configs which use geo_id_col: "geoid" (soybeans_bra.yaml:commodity.builders.climo.geo_id_col).
Data-fix notes (from config comments)¶
- Weather and climo zarrs had 4 hash-disambiguated geoids (e.g.
ADM0:BRA/ADM1:rio granda do norte/ADM2:varzea:szdne4om) that failedGEO_ID_PATTERN. Cleaned zarrs drop those 4 rows. - Forecast zarrs also cleaned for the same issue.
Weather and climo windows¶
Climo windows (soybeans_bra.yaml:commodity.climo_windows):
| Name | sdoy_start | sdoy_end | Calendar |
|---|---|---|---|
gstd |
1 | null | Full growing season to date (Sep 1 onwards) |
sep_dec |
1 | 122 | Sep 1 – Dec 31 (early-season reference, analogous to US apr_jul) |
Weather windows (soybeans_bra.yaml:commodity.weather_windows) — four phenological phases under Sep-1 epoch:
| Name | sdoy_start | sdoy_end | Calendar | Phase |
|---|---|---|---|---|
sep_oct |
1 | 61 | Sep 1 – Oct 31 | Planting |
nov |
62 | 91 | Nov 1 – Nov 30 | Flowering |
dec_jan |
92 | 153 | Dec 1 – Jan 31 | Pod set |
feb_mar |
154 | 212 | Feb 1 – Mar 31 | Pod fill + early harvest |
Feature columns¶
17 features (soybeans_bra.yaml:commodity.feature_cols) — trimmed from a larger candidate set due to upstream data issues:
Dropped features (config-fix, 2026-04-30):
- edd_zscore_*: 100% NaN in Brazil climo zarr (EDD near-zero in tropical Brazil; zero standard deviation makes z-score undefined).
- dry_days_zscore_gstd: ~30% ±inf values (same zero-variance pathology manifesting as inf rather than NaN).
Active features use sep_dec climo window references in place of the US apr_jul equivalents. Z-score variables in climo_zscore_vars also exclude edd for the same reason.
Season-DOY weather ramp¶
Same immediate step-on as US soybeans (soybeans_bra.yaml:model.regression_params.season_doy_weather_weight):
Full weather correction weight from sdoy 2 onwards.
Reference data¶
Two CONAB series (vs a single WASDE series for US crops) — the first is treated as the primary in-season comparator:
kind |
name |
filepath |
unit |
cutoff_month_day |
|---|---|---|---|---|
conab_levantamento |
conab_lev |
data/conab/conab_levantamento_graos.txt |
kg_per_ha |
month 10, day 1 |
conab_final |
conab_final |
data/conab/conab_serie_historica_graos.txt |
kg_per_ha |
month 10, day 1 |
Key difference from US configs: units are kg_per_ha (not bu_acre), and the cutoff is 1 Oct (not 1 Feb). The Levantamento series (monthly crop assessment) goes first; Série Histórica (post-harvest final) is secondary (soybeans_bra.yaml:reference_data).
Forecast paths¶
| Field | Value |
|---|---|
raw_obs_filepath |
data/weather/areal_aggregation/brazil_prod_adm2_clean.zarr |
materialised_climo_filepath |
data/weather/climatology/brazil_prod_adm2_baseline_1980_2025_w31_materialised_clean.zarr |
Local relative paths (resolved under data_root), not S3 templates.
What makes this config distinctive¶
This is the only non-US config in the pipeline. It combines three structural departures from the US configs: (1) southern-hemisphere cross-year season (Sep–Mar), requiring season_start_year_offset: -1 with a Sep-1 epoch rather than Oct-1; (2) IBGE-PAM yields parquet instead of NASS, with actuals_source_short: IBGE labels throughout metrics and plots; (3) dual CONAB reference series (conab_levantamento + conab_final) in units of kg_per_ha with an Oct-1 cutoff. It also carries the only non-TFFS_V0 public model name (TFFS_BRA_SOJA_V0), and the climo builder requires geo_id_col: "identifier" rather than the US-standard "geoid". The feature set was explicitly trimmed from the US candidate list due to EDD z-score NaN pathology unique to tropical climatology.