Pipeline: Forecast¶

Purpose¶

The FORECAST pipeline issues a point-in-time yield forecast for a single (season_year, init_date) pair against an already-fitted run_dir. It builds weather-spliced features, scores them using the production model, postprocesses with the calibrated bias corrector and conformal CI bands, and writes ADM0/ADM1/ADM2 delivery CSVs. It is entirely separate from the hindcast walk-forward CV loop and never writes to canonical hindcast artefacts.

Inputs¶

Artefact	Path	Role
`run_dir/`	caller-supplied	Completed hindcast or fit-production run
`config_resolved.yaml`	`run_dir/`	`ExperimentConfig` including mandatory `forecast.residual_mode`
Production model	`models/{experiment_key}/production/`	Detrender + imputer + regressor
CV-fold `walk_forward_preds`	`preds/{experiment_key}/{fold_label}/`	Required for OOS residual modes
`production/train_preds.parquet`	`preds/{experiment_key}/`	Required for `in_sample_pooled`
`conformal/{mode}.parquet`	`run_dir/conformal/`	CI calibration sidecar
Raw weather observations	`config.forecast.raw_obs_path`	Daily obs up to `init_date`
Climatology zarr	`config.forecast.materialised_climo_path`	Historical indices for climo splice
Canonical `pred.parquet`	`{features_dir}/{experiment_key}/pred.parquet`	Historical area for `_impute_forecast_area` (read-only)

Outputs¶

All artefacts land under run_dir/forecast/{season_year}/{init_date}/ (path restructured by PR #369):

Artefact	Path
`indices.zarr`	`forecast/{season_year}/{init_date}/`
`features/pred.parquet`	`forecast/{season_year}/{init_date}/features/`
`preds/{experiment_key}/production/walk_forward_preds.parquet`	`forecast/{season_year}/{init_date}/`
`postprocessed/national.parquet`	`forecast/{season_year}/{init_date}/postprocessed/`
`bias_corrector.pkl`	`forecast/{season_year}/{init_date}/`
`Treefera_{key}_ADM0_Forecast_{init_date}.csv`	`forecast/{season_year}/{init_date}/delivery/`
`Treefera_{key}_ADM1_Forecast_{init_date}.csv`	`forecast/{season_year}/{init_date}/delivery/`
`Treefera_{key}_ADM2_Forecast_{init_date}.csv`	`forecast/{season_year}/{init_date}/delivery/`

Step-by-step¶

1. Entry — `run()` (`run_forecast.py:143`)¶

def run(run_dir, *, season_year: int, init_date: date, force: bool = False) -> None

The CLI invokes this as cli run forecast --run-dir <path> --season-year <Y> --init-date <D>. The --config shortcut was removed by PR #372.

2. `validate_residual_mode` (`run_forecast.py:91`)¶

The first call in run(). Inspects the on-disk run_dir and the YAML's forecast.residual_mode to determine compatibility before any I/O:

Condition	Outcome
No CV folds and no production fold	`FileNotFoundError` — run `make hindcast`
OOS mode + no CV folds	`FileNotFoundError` — run hindcast or switch to `in_sample_pooled`
`in_sample_pooled` + no production fold	`FileNotFoundError` — run `cli run fit-production`
Otherwise	pass

forecast.residual_mode is mandatory on ForecastConfig since PR #372; there is no default.

3. `run_features()` — Forecast feature build (`run_forecast.py:167`)¶

Skip-if-exists guard: when force=False and both indices_zarr and features_parquet already exist, the build is a no-op (logged at INFO).

Otherwise:

run_preflight(preflight_paths_for_forecast_features(config)) — checks required input paths.
ForecastSlice(run_dir, ..., season_year, init_date) — the single source of truth for all artefact paths under forecast/{season_year}/{init_date}/.
materialise_forecast_indices(config, results) — builds the daily weather indices zarr at results.indices_zarr. Splices real observations up to init_date with climatology analogue beyond it.
_build_forecast_features(config, results) (run_forecast.py:264) — overrides the weather builder to read results.indices_zarr, narrows the year window to season_year, calls the long-range stubs (synthesise_long_range_climo_for_unseen_years, synthesise_long_range_stress_for_unseen_years) for years beyond zarr coverage, then build_features(forecast_cfg, output_dir=results.features_dir).
_impute_forecast_area(config, results) — fills NaN area_harvested_ha via 3-year trailing median from the canonical pred.parquet (read-only reference).

4. `run_predict()` — Predict + postprocess + deliver (`run_forecast.py:221`)¶

run_preflight(preflight_paths_for_forecast_predict(...)).
ForecastSlice(...) — path handle.
run_predict_stage(run_dir, season_year=season_year, init_date=init_date) — alias to stages.run_predict.run_predict; loads the production model, scores the forecast feature parquet, writes walk_forward_preds.parquet under the production fold path.
_postprocess_forecast(experiment, results) — single-production-fold postprocess: area-weights to ADM0, fits and saves the bias corrector, calls primary_calibration to load the CI sidecar, calls build_rows to assemble national rows with bias and CI columns, writes results.postprocessed_national_path.
_deliver_forecast(experiment, results) — for each of ("ADM0", "ADM1", "ADM2"): calls walk_forward_preds_to_delivery_rows(mode="forecast"), validates through HindcastDelivery, serialises to a Polars DataFrame, applies post-transforms, writes Treefera_{key}_{level}_Forecast_{init_date}.csv.

Mermaid Flow¶

flowchart TD
    CLI["cli run forecast\n--run-dir --season-year --init-date"]
    ENTRY["run()\nrun_forecast.py:143"]
    GUARD["validate_residual_mode()\nrun_forecast.py:91\nfail-fast before any I/O"]
    FE["run_features()\nrun_forecast.py:167"]
    IDXZ["materialise_forecast_indices()\n→ indices.zarr"]
    STUB["long-range climo/stress stubs\nforecast_long_range_stub.py"]
    BFEAT["build_features(forecast_cfg)\n→ features/pred.parquet"]
    AREA["_impute_forecast_area()\ntrailing-3yr median\nfrom canonical pred.parquet\n(read-only)"]
    RP["run_predict()\nrun_forecast.py:221"]
    PRED["run_predict_stage()\nstages/run_predict.py\n→ walk_forward_preds.parquet"]
    POST["_postprocess_forecast()\naggregate ADM0 + bias_corrector\n+ primary_calibration\n→ postprocessed/national.parquet"]
    DEL["_deliver_forecast()\nADM0/ADM1/ADM2 CSVs\n→ forecast/{sy}/{id}/delivery/"]

    CLI --> ENTRY
    ENTRY --> GUARD
    GUARD --> FE
    FE --> IDXZ
    IDXZ --> STUB
    STUB --> BFEAT
    BFEAT --> AREA
    AREA --> RP
    RP --> PRED
    PRED --> POST
    POST --> DEL

    HINDCAST["Completed run_dir\n(production model + conformal sidecars)"]
    HINDCAST -. "read-only reference" .-> ENTRY

Invariants¶

Read-only-on-canonical-hindcast invariant (DESIGN.md:125): the forecast pipeline SHALL NOT write to canonical hindcast artefacts under {features_dir}/{experiment_key}/. _impute_forecast_area reads the canonical pred.parquet but writes only to results.features_parquet under the forecast sub-directory.

ForecastSlice.root as single path source of truth (lib/results/results_slice.py): all forecast artefact paths are derived from run_dir / 'forecast' / str(season_year) / f'{init_date:%Y-%m-%d}'. Hard-coding the path string anywhere else is a bug.

forecast.residual_mode is mandatory: there is no default value. A config YAML without this field will fail Pydantic validation before run() is entered.

validate_residual_mode must be the first call: PR #372 mandates this. Moving it after feature build would waste minutes of compute on an incompatible run_dir.

Production model fold resolution (run_predict.py:86–98): tries models/{experiment_key}/production/ first; falls back to models/{experiment_key}/{season_year}/. Raises FileNotFoundError if neither exists.

Failure Modes¶

validate_residual_mode raises: most likely cause is invoking make forecast without a prior make hindcast or cli run fit-production. Error message is actionable and includes the exact next command.
Long-range climo stub emits warnings: three WARNING log lines appear when season_year is beyond the zarr's coverage. This is normal behaviour; the stub fills with trailing medians and the forecast collapses to trend-only (see multi_year_forecast.md).
_impute_forecast_area FileNotFoundError: canonical pred.parquet is absent (features have not been built for the hindcast). The forecast must reference the historical feature matrix.
Skip-if-exists reuse of stale artefacts: if force=False and an earlier forecast for the same (season_year, init_date) exists, the feature build is skipped. This is correct for reruns but will not pick up updated raw-weather files. Pass force=True to rebuild.
OOS calibration empty rows: CalibrationResult.to_frame() now raises with diagnostic context (PR #372 defensive guard) instead of bare KeyError: 'fold_year'.

Cross-references¶

stages.md — run_forecast.py full function signatures
delivery.md — walk_forward_preds_to_delivery_rows, DeliveryRow, HindcastDelivery
PR-369.md — path restructure forecast/{init_date}/ → forecast/{season_year}/{init_date}/
PR-372.md — forecast.residual_mode mandatory + validate_residual_mode gate
multi_year_forecast.md — multi-season_year orchestration and long-range stub
deliver.md — hindcast (CV-fold) delivery, parallel code path

PRs¶

PR #369 — restructured forecast artefact paths from forecast/{init_date}/ to forecast/{season_year}/{init_date}/ and introduced the long-range climo stub.
PR #372 — made forecast.residual_mode mandatory and added validate_residual_mode as the first call in run().