Brazil Soybean (Soja) Experiment Config¶

Top-level fields¶

Field	Value	Meaning
`random_seed`	42	Global RNG seed
`mlflow_tracking_uri`	`sqlite:///mlruns.db`	Local SQLite MLflow store
`experiment_name`	`brazil_soybean_yield_prediction`	MLflow experiment name
`feature_start_year`	1990	IBGE-PAM has data from 1974 but soja is sparse before 1990
`feature_end_year`	2025	Latest feature year
`check_data_exists`	`[]`	No extra preflight checks
`commodity.commodity`	`soybeans`	Internal key kept as `soybeans` — required by `COMMODITY_STRESS_VARIABLES` branch in `stress_compute.py`
`commodity.country_code`	`BRA`	ISO-3 code; drives `geo_identifier` ADM0 segment and `experiment_key`
`commodity.season_start`	month 9, day 1	Season epoch: 1 Sep of the prior calendar year
`commodity.season_start_year_offset`	-1	Cross-year: `season_start_date(2024) = Sep 1, 2023`
`commodity.harvest_season_doy`	212	End of productive GS: Mar 31 (Sep-1 + 212 d)
`commodity.bushel_weight_lbs`	60.0	Universal soybean bushel weight
`commodity.delivery_unit`	`bu_acre`	Kept as `bu_acre` to avoid breaking hardcoded column-rename map in `delivery/export.py`
`commodity.yield_range`	`[10.0, 100.0]`	Bounds in bu/acre; PAM range [800, 6500] kg/ha maps to ~[11.9, 96.6] bu/acre
`commodity.actuals_source_short`	`IBGE`	Short label used in metrics output (`vs_IBGE`)
`commodity.actuals_source_label`	`IBGE-PAM municipal yield (area-weighted)`	Full label for plots
`commodity.freeze_cap_sdoy`	184	184 d from Sep 1 = early March; same numeric cap as US configs
`experiment_protocol.cv_strategy`	`expanding`	Walk-forward expanding CV
`experiment_protocol.test_years`	2020–2024	Hold-out years
`experiment_protocol.production_cumulative_threshold`	0.90	Top-90% of municípios by production
`experiment_protocol.production_recent_years`	5	Years for production ranking
`model.detrend`	`partial_pooling`	Partial-pooling detrender
`model.detrend_params`	`{}`	No fixed slope
`model.regression`	`pca_ridge`	PCA(2) → Ridge
`model.weather_correction_fit_level`	`ADM0`	National-level weather correction
`model.regression_params.n_components`	2	PCA components
`model.regression_params.alpha`	10.0	Ridge regularisation
`model.regression_params.nan_policy`	`raise`	Regressors reject NaNs
`forecast.residual_mode`	`hindcast_oos_per_init_date`	Conformal calibration mode
`postprocess.bias_corrector.kind`	`none`	No bias correction
`delivery.model_public_name`	`TFFS_BRA_SOJA_V0`	Distinct public model name (only non-`TFFS_V0` label)
`delivery.ci_levels`	`[0.5, 0.68, 0.80, 0.90, 0.95]`	Conformal interval levels

Builders¶

Builder	Kind	Notable parameters
`yields`	`YieldsBuilder`	`filepath: data/ibge/soja_brazil_municipios.parquet` — IBGE-PAM source (not NASS); standard 2-rule edit chain; `required_for_pred_parquet: false`
`weather`	`WeatherBuilder`	`filepath: data/weather/indices/brazil_prod_adm2_brazil_soy_clean.zarr` — cleaned copy (4 hash-disambiguated geoids removed); `required_for_pred_parquet: true`
`weather_stress`	`WeatherBuilder`	`filepath: data/weather/stress/brazil_prod_adm2_brazil_soy_ytd_stress_clean.zarr` — locally-cleaned copy of the upstream S3 ytd-stress zarr (same 4 hash-disambiguated geoids removed); `required_for_pred_parquet: true`
`climo`	`ClimoBuilder`	`filepath: data/weather/climo_indices/brazil_prod_adm2_clean.zarr`; `geo_id_col: identifier` (not `geoid` as in US zarrs); `required_for_pred_parquet: true`

Has a weather_stress builder (no stress parquet builder); reads data/weather/stress/brazil_prod_adm2_brazil_soy_ytd_stress_clean.zarr directly — the locally-cleaned copy of the S3 upstream brazil_prod_adm2_brazil_soy_ytd_stress.zarr. The climo builder uses geo_id_col: "identifier" — a key difference from all US configs which use geo_id_col: "geoid" (soybeans_bra.yaml:commodity.builders.climo.geo_id_col).

Data-fix notes (from config comments)¶

Weather and climo zarrs had 4 hash-disambiguated geoids (e.g. ADM0:BRA/ADM1:rio granda do norte/ADM2:varzea:szdne4om) that failed GEO_ID_PATTERN. Cleaned zarrs drop those 4 rows.
Forecast zarrs also cleaned for the same issue.

Weather and climo windows¶

Climo windows (soybeans_bra.yaml:commodity.climo_windows):

Name	sdoy_start	sdoy_end	Calendar
`gstd`	1	null	Full growing season to date (Sep 1 onwards)
`sep_dec`	1	122	Sep 1 – Dec 31 (early-season reference, analogous to US `apr_jul`)

Weather windows (soybeans_bra.yaml:commodity.weather_windows) — four phenological phases under Sep-1 epoch:

Name	sdoy_start	sdoy_end	Calendar	Phase
`sep_oct`	1	61	Sep 1 – Oct 31	Planting
`nov`	62	91	Nov 1 – Nov 30	Flowering
`dec_jan`	92	153	Dec 1 – Jan 31	Pod set
`feb_mar`	154	212	Feb 1 – Mar 31	Pod fill + early harvest

Feature columns¶

17 features (soybeans_bra.yaml:commodity.feature_cols) — trimmed from a larger candidate set due to upstream data issues:

Dropped features (config-fix, 2026-04-30): - edd_zscore_*: 100% NaN in Brazil climo zarr (EDD near-zero in tropical Brazil; zero standard deviation makes z-score undefined). - dry_days_zscore_gstd: ~30% ±inf values (same zero-variance pathology manifesting as inf rather than NaN).

Active features use sep_dec climo window references in place of the US apr_jul equivalents. Z-score variables in climo_zscore_vars also exclude edd for the same reason.

Season-DOY weather ramp¶

Same immediate step-on as US soybeans (soybeans_bra.yaml:model.regression_params.season_doy_weather_weight):

1: 0.0
2: 1.0

Full weather correction weight from sdoy 2 onwards.

Reference data¶

Two CONAB series (vs a single WASDE series for US crops) — the first is treated as the primary in-season comparator:

`kind`	`name`	`filepath`	`unit`	`cutoff_month_day`
`conab_levantamento`	`conab_lev`	`data/conab/conab_levantamento_graos.txt`	`kg_per_ha`	month 10, day 1
`conab_final`	`conab_final`	`data/conab/conab_serie_historica_graos.txt`	`kg_per_ha`	month 10, day 1

Key difference from US configs: units are kg_per_ha (not bu_acre), and the cutoff is 1 Oct (not 1 Feb). The Levantamento series (monthly crop assessment) goes first; Série Histórica (post-harvest final) is secondary (soybeans_bra.yaml:reference_data).

Forecast paths¶

Field	Value
`raw_obs_filepath`	`data/weather/areal_aggregation/brazil_prod_adm2_clean.zarr`
`materialised_climo_filepath`	`data/weather/climatology/brazil_prod_adm2_baseline_1980_2025_w31_materialised_clean.zarr`

Local relative paths (resolved under data_root), not S3 templates.

What makes this config distinctive¶

This is the only non-US config in the pipeline. It combines three structural departures from the US configs: (1) southern-hemisphere cross-year season (Sep–Mar), requiring season_start_year_offset: -1 with a Sep-1 epoch rather than Oct-1; (2) IBGE-PAM yields parquet instead of NASS, with actuals_source_short: IBGE labels throughout metrics and plots; (3) dual CONAB reference series (conab_levantamento + conab_final) in units of kg_per_ha with an Oct-1 cutoff. It also carries the only non-TFFS_V0 public model name (TFFS_BRA_SOJA_V0), and the climo builder requires geo_id_col: "identifier" rather than the US-standard "geoid". The feature set was explicitly trimmed from the US candidate list due to EDD z-score NaN pathology unique to tropical climatology.