Climatology Materialisation¶
What it is¶
When the forecast pipeline runs, it must produce a full daily weather-index series
covering the growing season from season_start_date to harvest_date. For dates
after the forecast init_date, observed weather is not yet available. The pipeline
fills those future dates by splicing in materialised climatology — pre-computed
long-run daily statistics (mean, standard deviation) for each location and day-of-year.
This mechanism has two components:
materialise_forecast_indices(features/forecast_weather.py:43) — the function that orchestrates the splice for a single(season_year, init_date)pair.materialised_climo_filepath(config.py:593) — theResolvablePathfield onForecastConfigthat points to the pre-built climatology zarr.
A third component was added in PR-369: a long-range stub that fires when the
climatology zarr does not cover all calendar years needed by the requested
season_year.
Where it lives¶
| Symbol | File | Line |
|---|---|---|
materialise_forecast_indices |
features/forecast_weather.py |
43 |
ForecastConfig.materialised_climo_filepath |
config.py |
593 |
ForecastConfig.raw_obs_filepath |
config.py |
592 |
| Long-range stub | features/forecast_long_range_stub.py |
(added PR-369) |
materialise_forecast_indices step by step¶
- Guard: raises
ValueErrorifconfig.forecast is None— the function is only valid in the forecast pipeline context. - Season bounds: computes
season_start_dateandharvest_datefrom the commodity config for the givenseason_year. - Obs slice: opens
config.forecast.raw_obs_filepathas an xarray zarr and slices to[season_start_date, init_date]— the observed portion. - Climatology: calls
materialise_for_forecast(str(config.forecast.materialised_climo_filepath), start=..., end=...)(fromtf_data_ml_utils.weather.stages.climatology). This function reads the pre-built climo zarr and returns a dataset covering[season_start_date, harvest_date]. - Splice: if the obs slice is non-empty,
extend_with_climatology(obs_slice, materialised_climo=mat, end=str(harvest_date))merges the two; if the obs slice is empty (e.g. a pure future forecast with no observed dates in the season), the climatology mean is used directly (mat[shared].sel(statistic="mean", drop=True)). - Index computation: computes
daily_gdd,daily_edd, anddaily_precipfrom the extended dataset usingtf_data_ml_utils.weather.stages.indices.indices.compute. - Write: packs the three index DataArrays into a dataset with the
conus_adm2_corn.zarrroot-attrs schema and writes toresults.indices_zarr(i.e.run_dir/forecast/{season_year}/{init_date}/indices.zarr).
The three indices are defined as a module-level tuple at forecast_weather.py:29–33:
_CORN_DAILY_INDICES: tuple[tuple[str, str, dict[str, float]], ...] = (
("daily_gdd", "gdd", {"thresh": 10.0}),
("daily_edd", "edd", {"thresh": 29.0}),
("daily_precip", "precip", {}),
)
ForecastConfig.materialised_climo_filepath¶
class ForecastConfig(BaseModel):
raw_obs_filepath: ResolvablePath
materialised_climo_filepath: ResolvablePath
residual_mode: ResidualMode
init_date: date | None = None
Both raw_obs_filepath and materialised_climo_filepath are ResolvablePath fields
(config.py:592–593), so they are resolved against data_root at config-load time
and preflight-checked for existence before any stage runs. They are required only on
the forecast code path; in hindcast mode config.forecast is None.
The DESIGN.md forecast pipeline isolation clause states:
"WHEN the forecast pipeline runs, the system SHALL treat canonical hindcast artefacts as read-only reference data. External sources (
raw_obs_filepath,materialised_climo_filepath) are consumed exclusively in feature-creation stages, baking outputs intorun_dir/forecast/{init_date}/. All subsequent stages read only from that subdirectory."
The baked output is indices.zarr inside run_dir/forecast/{season_year}/{init_date}/
(layout changed from {init_date}/ to {season_year}/{init_date}/ in PR-369).
Long-range stub (PR-369)¶
The climatology zarr is built from observed historical data and its year coordinate
stops at the latest observed season year. When a future season_year requires calendar
years not yet in the zarr, materialise_for_forecast emits zero rows and the feature
builder fails.
PR-369 added features/forecast_long_range_stub.py to handle this case. The stub:
- Computes
needed_cal_years— the set of calendar years touched by the targetseason_year(e.g.{2026, 2027}for US wheat season_year 2027). - Reads
available_cal_yearsfrom the zarr'syearcoordinate. - If
needed_cal_years.issubset(available_cal_years), returns immediately — the normal climo builder handles it. - Otherwise, emits three
logger.warninglines listing the missing years and falls back to a panel trailing-median imputation of the missing z-score feature rows.
Panel trailing-median imputation method¶
The stub delegates to impute_missing_panel_columns (generalised in PR-369 from the
existing area-fill imputer). The method:
- For each county (
geo_identifier) and each missing feature column, computes the trailing 3-year median of that column from the canonical hindcastpred.parquetfeature matrix. - The imputed value is bounded to the county's historical
[min, max]. If the bound is violated, aValueErroris raised naming the violating rows — this is always a bug, not a data anomaly to absorb silently. - Method dispatch supports
"trailing_median","trailing_mean", and"zero". The stub uses"trailing_median"by default.
Why long-range forecasts collapse to trend-only¶
For an init_date that falls before the target season_year's season start,
season_doy ≤ 1 and the commodity's season_doy_weather_weight schedule evaluates to
w = 0. The regression prediction is therefore yield = trend(year, county) regardless
of the imputed z-score feature values. This is a deliberate model design choice, not a
limitation of the stub.
The long-range stub is explicitly temporary. Its module docstring documents removal criteria: delete it when the climo zarr is extended to cover the required forecast horizon and no caller imports from the module.
Key invariants¶
materialised_climo_filepathandraw_obs_filepathare bothResolvablePathfields and are preflight-checked before any forecast stage runs.- The output zarr schema mirrors
conus_adm2_corn.zarrso downstream builders can read it unchanged — this is enforced via_pack_indices_datasetinforecast_weather.py:108. - The climo splice uses
statistic="mean"for pure-future forecasts; for mixed observed/future seasons it usesextend_with_climatologywhich blends observed and climatological values at theinit_dateboundary.
How it interacts with the pipeline¶
materialise_forecast_indices is called by stages/run_forecast.py as the first step
of the forecast feature build. Its output (indices.zarr) is consumed by the weather
and stress feature builders in the same stage. All subsequent stages (run_predict,
run_meta_models, run_deliver) read from the baked run_dir/forecast/{season_year}/{init_date}/
subtree and never touch the canonical materialised_climo_filepath again (DESIGN.md
forecast isolation clause).
Pitfalls¶
- If
config.forecastisNone(hindcast mode), callingmaterialise_forecast_indicesraises immediately — the guard at line 54 prevents silent misconfiguration. - The obs slice uses
xr.open_zarr(str(config.forecast.raw_obs_filepath))— thestr()conversion is required because xarray does not acceptCloudPathobjects (same rule as polars; see s3_path_safety). - The long-range stub fires automatically based on zarr coverage, with no explicit
flag. Callers that expect normal climo behaviour will get silently degraded stub
output for future season years; the three
WARNINGlog lines are the only signal.
Related entities and concepts¶
- s3_path_safety —
materialised_climo_filepathis aResolvablePath; S3 URIs are valid - RunDir — output zarr lands under
run_dir/forecast/{season_year}/{init_date}/ - PR-369 — restructured forecast layout; added long-range stub
- DESIGN.md — forecast isolation clause
PRs¶
| PR | Relevance |
|---|---|
| PR-369 | Forecast path restructure; long-range stub; panel imputer generalisation |
Open questions¶
- The long-range stub only handles z-score weather features;
stress_scorehas a bounded range and was explicitly excluded from PR-369 with a note that it needs explicit per-column method choices before a stub can be wired in. - The
_CORN_DAILY_INDICEStuple is corn-specific despite living in a module calledforecast_weather.py. Other commodities (soybeans, wheat, cotton) may need their own index specs or a registry-based dispatch similar to the feature builder pattern.