Skip to content

Climatology Materialisation

What it is

When the forecast pipeline runs, it must produce a full daily weather-index series covering the growing season from season_start_date to harvest_date. For dates after the forecast init_date, observed weather is not yet available. The pipeline fills those future dates by splicing in materialised climatology — pre-computed long-run daily statistics (mean, standard deviation) for each location and day-of-year.

This mechanism has two components:

  1. materialise_forecast_indices (features/forecast_weather.py:43) — the function that orchestrates the splice for a single (season_year, init_date) pair.
  2. materialised_climo_filepath (config.py:593) — the ResolvablePath field on ForecastConfig that points to the pre-built climatology zarr.

A third component was added in PR-369: a long-range stub that fires when the climatology zarr does not cover all calendar years needed by the requested season_year.

Where it lives

Symbol File Line
materialise_forecast_indices features/forecast_weather.py 43
ForecastConfig.materialised_climo_filepath config.py 593
ForecastConfig.raw_obs_filepath config.py 592
Long-range stub features/forecast_long_range_stub.py (added PR-369)

materialise_forecast_indices step by step

features/forecast_weather.py:43
  1. Guard: raises ValueError if config.forecast is None — the function is only valid in the forecast pipeline context.
  2. Season bounds: computes season_start_date and harvest_date from the commodity config for the given season_year.
  3. Obs slice: opens config.forecast.raw_obs_filepath as an xarray zarr and slices to [season_start_date, init_date] — the observed portion.
  4. Climatology: calls materialise_for_forecast(str(config.forecast.materialised_climo_filepath), start=..., end=...) (from tf_data_ml_utils.weather.stages.climatology). This function reads the pre-built climo zarr and returns a dataset covering [season_start_date, harvest_date].
  5. Splice: if the obs slice is non-empty, extend_with_climatology(obs_slice, materialised_climo=mat, end=str(harvest_date)) merges the two; if the obs slice is empty (e.g. a pure future forecast with no observed dates in the season), the climatology mean is used directly (mat[shared].sel(statistic="mean", drop=True)).
  6. Index computation: computes daily_gdd, daily_edd, and daily_precip from the extended dataset using tf_data_ml_utils.weather.stages.indices.indices.compute.
  7. Write: packs the three index DataArrays into a dataset with the conus_adm2_corn.zarr root-attrs schema and writes to results.indices_zarr (i.e. run_dir/forecast/{season_year}/{init_date}/indices.zarr).

The three indices are defined as a module-level tuple at forecast_weather.py:29–33:

_CORN_DAILY_INDICES: tuple[tuple[str, str, dict[str, float]], ...] = (
    ("daily_gdd", "gdd", {"thresh": 10.0}),
    ("daily_edd", "edd", {"thresh": 29.0}),
    ("daily_precip", "precip", {}),
)

ForecastConfig.materialised_climo_filepath

class ForecastConfig(BaseModel):
    raw_obs_filepath: ResolvablePath
    materialised_climo_filepath: ResolvablePath
    residual_mode: ResidualMode
    init_date: date | None = None

Both raw_obs_filepath and materialised_climo_filepath are ResolvablePath fields (config.py:592–593), so they are resolved against data_root at config-load time and preflight-checked for existence before any stage runs. They are required only on the forecast code path; in hindcast mode config.forecast is None.

The DESIGN.md forecast pipeline isolation clause states:

"WHEN the forecast pipeline runs, the system SHALL treat canonical hindcast artefacts as read-only reference data. External sources (raw_obs_filepath, materialised_climo_filepath) are consumed exclusively in feature-creation stages, baking outputs into run_dir/forecast/{init_date}/. All subsequent stages read only from that subdirectory."

The baked output is indices.zarr inside run_dir/forecast/{season_year}/{init_date}/ (layout changed from {init_date}/ to {season_year}/{init_date}/ in PR-369).

Long-range stub (PR-369)

The climatology zarr is built from observed historical data and its year coordinate stops at the latest observed season year. When a future season_year requires calendar years not yet in the zarr, materialise_for_forecast emits zero rows and the feature builder fails.

PR-369 added features/forecast_long_range_stub.py to handle this case. The stub:

  1. Computes needed_cal_years — the set of calendar years touched by the target season_year (e.g. {2026, 2027} for US wheat season_year 2027).
  2. Reads available_cal_years from the zarr's year coordinate.
  3. If needed_cal_years.issubset(available_cal_years), returns immediately — the normal climo builder handles it.
  4. Otherwise, emits three logger.warning lines listing the missing years and falls back to a panel trailing-median imputation of the missing z-score feature rows.

Panel trailing-median imputation method

The stub delegates to impute_missing_panel_columns (generalised in PR-369 from the existing area-fill imputer). The method:

  • For each county (geo_identifier) and each missing feature column, computes the trailing 3-year median of that column from the canonical hindcast pred.parquet feature matrix.
  • The imputed value is bounded to the county's historical [min, max]. If the bound is violated, a ValueError is raised naming the violating rows — this is always a bug, not a data anomaly to absorb silently.
  • Method dispatch supports "trailing_median", "trailing_mean", and "zero". The stub uses "trailing_median" by default.

Why long-range forecasts collapse to trend-only

For an init_date that falls before the target season_year's season start, season_doy ≤ 1 and the commodity's season_doy_weather_weight schedule evaluates to w = 0. The regression prediction is therefore yield = trend(year, county) regardless of the imputed z-score feature values. This is a deliberate model design choice, not a limitation of the stub.

The long-range stub is explicitly temporary. Its module docstring documents removal criteria: delete it when the climo zarr is extended to cover the required forecast horizon and no caller imports from the module.

Key invariants

  • materialised_climo_filepath and raw_obs_filepath are both ResolvablePath fields and are preflight-checked before any forecast stage runs.
  • The output zarr schema mirrors conus_adm2_corn.zarr so downstream builders can read it unchanged — this is enforced via _pack_indices_dataset in forecast_weather.py:108.
  • The climo splice uses statistic="mean" for pure-future forecasts; for mixed observed/future seasons it uses extend_with_climatology which blends observed and climatological values at the init_date boundary.

How it interacts with the pipeline

materialise_forecast_indices is called by stages/run_forecast.py as the first step of the forecast feature build. Its output (indices.zarr) is consumed by the weather and stress feature builders in the same stage. All subsequent stages (run_predict, run_meta_models, run_deliver) read from the baked run_dir/forecast/{season_year}/{init_date}/ subtree and never touch the canonical materialised_climo_filepath again (DESIGN.md forecast isolation clause).

Pitfalls

  • If config.forecast is None (hindcast mode), calling materialise_forecast_indices raises immediately — the guard at line 54 prevents silent misconfiguration.
  • The obs slice uses xr.open_zarr(str(config.forecast.raw_obs_filepath)) — the str() conversion is required because xarray does not accept CloudPath objects (same rule as polars; see s3_path_safety).
  • The long-range stub fires automatically based on zarr coverage, with no explicit flag. Callers that expect normal climo behaviour will get silently degraded stub output for future season years; the three WARNING log lines are the only signal.
  • s3_path_safetymaterialised_climo_filepath is a ResolvablePath; S3 URIs are valid
  • RunDir — output zarr lands under run_dir/forecast/{season_year}/{init_date}/
  • PR-369 — restructured forecast layout; added long-range stub
  • DESIGN.md — forecast isolation clause

PRs

PR Relevance
PR-369 Forecast path restructure; long-range stub; panel imputer generalisation

Open questions

  • The long-range stub only handles z-score weather features; stress_score has a bounded range and was explicitly excluded from PR-369 with a note that it needs explicit per-column method choices before a stub can be wired in.
  • The _CORN_DAILY_INDICES tuple is corn-specific despite living in a module called forecast_weather.py. Other commodities (soybeans, wheat, cotton) may need their own index specs or a registry-based dispatch similar to the feature builder pattern.