Skip to content

Pipeline: Forecast

Purpose

The FORECAST pipeline issues a point-in-time yield forecast for a single (season_year, init_date) pair against an already-fitted run_dir. It builds weather-spliced features, scores them using the production model, postprocesses with the calibrated bias corrector and conformal CI bands, and writes ADM0/ADM1/ADM2 delivery CSVs. It is entirely separate from the hindcast walk-forward CV loop and never writes to canonical hindcast artefacts.

Inputs

Artefact Path Role
run_dir/ caller-supplied Completed hindcast or fit-production run
config_resolved.yaml run_dir/ ExperimentConfig including mandatory forecast.residual_mode
Production model models/{experiment_key}/production/ Detrender + imputer + regressor
CV-fold walk_forward_preds preds/{experiment_key}/{fold_label}/ Required for OOS residual modes
production/train_preds.parquet preds/{experiment_key}/ Required for in_sample_pooled
conformal/{mode}.parquet run_dir/conformal/ CI calibration sidecar
Raw weather observations config.forecast.raw_obs_path Daily obs up to init_date
Climatology zarr config.forecast.materialised_climo_path Historical indices for climo splice
Canonical pred.parquet {features_dir}/{experiment_key}/pred.parquet Historical area for _impute_forecast_area (read-only)

Outputs

All artefacts land under run_dir/forecast/{season_year}/{init_date}/ (path restructured by PR #369):

Artefact Path
indices.zarr forecast/{season_year}/{init_date}/
features/pred.parquet forecast/{season_year}/{init_date}/features/
preds/{experiment_key}/production/walk_forward_preds.parquet forecast/{season_year}/{init_date}/
postprocessed/national.parquet forecast/{season_year}/{init_date}/postprocessed/
bias_corrector.pkl forecast/{season_year}/{init_date}/
Treefera_{key}_ADM0_Forecast_{init_date}.csv forecast/{season_year}/{init_date}/delivery/
Treefera_{key}_ADM1_Forecast_{init_date}.csv forecast/{season_year}/{init_date}/delivery/
Treefera_{key}_ADM2_Forecast_{init_date}.csv forecast/{season_year}/{init_date}/delivery/

Step-by-step

1. Entry — run() (run_forecast.py:143)

def run(run_dir, *, season_year: int, init_date: date, force: bool = False) -> None

The CLI invokes this as cli run forecast --run-dir <path> --season-year <Y> --init-date <D>. The --config shortcut was removed by PR #372.

2. validate_residual_mode (run_forecast.py:91)

The first call in run(). Inspects the on-disk run_dir and the YAML's forecast.residual_mode to determine compatibility before any I/O:

Condition Outcome
No CV folds and no production fold FileNotFoundError — run make hindcast
OOS mode + no CV folds FileNotFoundError — run hindcast or switch to in_sample_pooled
in_sample_pooled + no production fold FileNotFoundError — run cli run fit-production
Otherwise pass

forecast.residual_mode is mandatory on ForecastConfig since PR #372; there is no default.

3. run_features() — Forecast feature build (run_forecast.py:167)

Skip-if-exists guard: when force=False and both indices_zarr and features_parquet already exist, the build is a no-op (logged at INFO).

Otherwise:

  1. run_preflight(preflight_paths_for_forecast_features(config)) — checks required input paths.
  2. ForecastSlice(run_dir, ..., season_year, init_date) — the single source of truth for all artefact paths under forecast/{season_year}/{init_date}/.
  3. materialise_forecast_indices(config, results) — builds the daily weather indices zarr at results.indices_zarr. Splices real observations up to init_date with climatology analogue beyond it.
  4. _build_forecast_features(config, results) (run_forecast.py:264) — overrides the weather builder to read results.indices_zarr, narrows the year window to season_year, calls the long-range stubs (synthesise_long_range_climo_for_unseen_years, synthesise_long_range_stress_for_unseen_years) for years beyond zarr coverage, then build_features(forecast_cfg, output_dir=results.features_dir).
  5. _impute_forecast_area(config, results) — fills NaN area_harvested_ha via 3-year trailing median from the canonical pred.parquet (read-only reference).

4. run_predict() — Predict + postprocess + deliver (run_forecast.py:221)

  1. run_preflight(preflight_paths_for_forecast_predict(...)).
  2. ForecastSlice(...) — path handle.
  3. run_predict_stage(run_dir, season_year=season_year, init_date=init_date) — alias to stages.run_predict.run_predict; loads the production model, scores the forecast feature parquet, writes walk_forward_preds.parquet under the production fold path.
  4. _postprocess_forecast(experiment, results) — single-production-fold postprocess: area-weights to ADM0, fits and saves the bias corrector, calls primary_calibration to load the CI sidecar, calls build_rows to assemble national rows with bias and CI columns, writes results.postprocessed_national_path.
  5. _deliver_forecast(experiment, results) — for each of ("ADM0", "ADM1", "ADM2"): calls walk_forward_preds_to_delivery_rows(mode="forecast"), validates through HindcastDelivery, serialises to a Polars DataFrame, applies post-transforms, writes Treefera_{key}_{level}_Forecast_{init_date}.csv.

Mermaid Flow

flowchart TD
    CLI["cli run forecast\n--run-dir --season-year --init-date"]
    ENTRY["run()\nrun_forecast.py:143"]
    GUARD["validate_residual_mode()\nrun_forecast.py:91\nfail-fast before any I/O"]
    FE["run_features()\nrun_forecast.py:167"]
    IDXZ["materialise_forecast_indices()\n→ indices.zarr"]
    STUB["long-range climo/stress stubs\nforecast_long_range_stub.py"]
    BFEAT["build_features(forecast_cfg)\n→ features/pred.parquet"]
    AREA["_impute_forecast_area()\ntrailing-3yr median\nfrom canonical pred.parquet\n(read-only)"]
    RP["run_predict()\nrun_forecast.py:221"]
    PRED["run_predict_stage()\nstages/run_predict.py\n→ walk_forward_preds.parquet"]
    POST["_postprocess_forecast()\naggregate ADM0 + bias_corrector\n+ primary_calibration\n→ postprocessed/national.parquet"]
    DEL["_deliver_forecast()\nADM0/ADM1/ADM2 CSVs\n→ forecast/{sy}/{id}/delivery/"]

    CLI --> ENTRY
    ENTRY --> GUARD
    GUARD --> FE
    FE --> IDXZ
    IDXZ --> STUB
    STUB --> BFEAT
    BFEAT --> AREA
    AREA --> RP
    RP --> PRED
    PRED --> POST
    POST --> DEL

    HINDCAST["Completed run_dir\n(production model + conformal sidecars)"]
    HINDCAST -. "read-only reference" .-> ENTRY

Invariants

Read-only-on-canonical-hindcast invariant (DESIGN.md:125): the forecast pipeline SHALL NOT write to canonical hindcast artefacts under {features_dir}/{experiment_key}/. _impute_forecast_area reads the canonical pred.parquet but writes only to results.features_parquet under the forecast sub-directory.

ForecastSlice.root as single path source of truth (lib/results/results_slice.py): all forecast artefact paths are derived from run_dir / 'forecast' / str(season_year) / f'{init_date:%Y-%m-%d}'. Hard-coding the path string anywhere else is a bug.

forecast.residual_mode is mandatory: there is no default value. A config YAML without this field will fail Pydantic validation before run() is entered.

validate_residual_mode must be the first call: PR #372 mandates this. Moving it after feature build would waste minutes of compute on an incompatible run_dir.

Production model fold resolution (run_predict.py:86–98): tries models/{experiment_key}/production/ first; falls back to models/{experiment_key}/{season_year}/. Raises FileNotFoundError if neither exists.

Failure Modes

  • validate_residual_mode raises: most likely cause is invoking make forecast without a prior make hindcast or cli run fit-production. Error message is actionable and includes the exact next command.
  • Long-range climo stub emits warnings: three WARNING log lines appear when season_year is beyond the zarr's coverage. This is normal behaviour; the stub fills with trailing medians and the forecast collapses to trend-only (see multi_year_forecast.md).
  • _impute_forecast_area FileNotFoundError: canonical pred.parquet is absent (features have not been built for the hindcast). The forecast must reference the historical feature matrix.
  • Skip-if-exists reuse of stale artefacts: if force=False and an earlier forecast for the same (season_year, init_date) exists, the feature build is skipped. This is correct for reruns but will not pick up updated raw-weather files. Pass force=True to rebuild.
  • OOS calibration empty rows: CalibrationResult.to_frame() now raises with diagnostic context (PR #372 defensive guard) instead of bare KeyError: 'fold_year'.

See Also

  • ForecastSlice — the path-handle entity for each (season_year, init_date) artefact subtree; single source of truth for all forecast paths

Cross-references

  • stages.mdrun_forecast.py full function signatures
  • delivery.mdwalk_forward_preds_to_delivery_rows, DeliveryRow, HindcastDelivery
  • PR-369.md — path restructure forecast/{init_date}/ → forecast/{season_year}/{init_date}/
  • PR-372.mdforecast.residual_mode mandatory + validate_residual_mode gate
  • multi_year_forecast.md — multi-season_year orchestration and long-range stub
  • deliver.md — hindcast (CV-fold) delivery, parallel code path

PRs

  • PR #369 — restructured forecast artefact paths from forecast/{init_date}/ to forecast/{season_year}/{init_date}/ and introduced the long-range climo stub.
  • PR #372 — made forecast.residual_mode mandatory and added validate_residual_mode as the first call in run().