Pipeline: Forecast¶
Purpose¶
The FORECAST pipeline issues a point-in-time yield forecast for a single (season_year, init_date) pair against an already-fitted run_dir. It builds weather-spliced features, scores them using the production model, postprocesses with the calibrated bias corrector and conformal CI bands, and writes ADM0/ADM1/ADM2 delivery CSVs. It is entirely separate from the hindcast walk-forward CV loop and never writes to canonical hindcast artefacts.
Inputs¶
| Artefact | Path | Role |
|---|---|---|
run_dir/ |
caller-supplied | Completed hindcast or fit-production run |
config_resolved.yaml |
run_dir/ |
ExperimentConfig including mandatory forecast.residual_mode |
| Production model | models/{experiment_key}/production/ |
Detrender + imputer + regressor |
CV-fold walk_forward_preds |
preds/{experiment_key}/{fold_label}/ |
Required for OOS residual modes |
production/train_preds.parquet |
preds/{experiment_key}/ |
Required for in_sample_pooled |
conformal/{mode}.parquet |
run_dir/conformal/ |
CI calibration sidecar |
| Raw weather observations | config.forecast.raw_obs_path |
Daily obs up to init_date |
| Climatology zarr | config.forecast.materialised_climo_path |
Historical indices for climo splice |
Canonical pred.parquet |
{features_dir}/{experiment_key}/pred.parquet |
Historical area for _impute_forecast_area (read-only) |
Outputs¶
All artefacts land under run_dir/forecast/{season_year}/{init_date}/ (path restructured by PR #369):
| Artefact | Path |
|---|---|
indices.zarr |
forecast/{season_year}/{init_date}/ |
features/pred.parquet |
forecast/{season_year}/{init_date}/features/ |
preds/{experiment_key}/production/walk_forward_preds.parquet |
forecast/{season_year}/{init_date}/ |
postprocessed/national.parquet |
forecast/{season_year}/{init_date}/postprocessed/ |
bias_corrector.pkl |
forecast/{season_year}/{init_date}/ |
Treefera_{key}_ADM0_Forecast_{init_date}.csv |
forecast/{season_year}/{init_date}/delivery/ |
Treefera_{key}_ADM1_Forecast_{init_date}.csv |
forecast/{season_year}/{init_date}/delivery/ |
Treefera_{key}_ADM2_Forecast_{init_date}.csv |
forecast/{season_year}/{init_date}/delivery/ |
Step-by-step¶
1. Entry — run() (run_forecast.py:143)¶
The CLI invokes this as cli run forecast --run-dir <path> --season-year <Y> --init-date <D>. The --config shortcut was removed by PR #372.
2. validate_residual_mode (run_forecast.py:91)¶
The first call in run(). Inspects the on-disk run_dir and the YAML's forecast.residual_mode to determine compatibility before any I/O:
| Condition | Outcome |
|---|---|
| No CV folds and no production fold | FileNotFoundError — run make hindcast |
| OOS mode + no CV folds | FileNotFoundError — run hindcast or switch to in_sample_pooled |
in_sample_pooled + no production fold |
FileNotFoundError — run cli run fit-production |
| Otherwise | pass |
forecast.residual_mode is mandatory on ForecastConfig since PR #372; there is no default.
3. run_features() — Forecast feature build (run_forecast.py:167)¶
Skip-if-exists guard: when force=False and both indices_zarr and features_parquet already exist, the build is a no-op (logged at INFO).
Otherwise:
run_preflight(preflight_paths_for_forecast_features(config))— checks required input paths.ForecastSlice(run_dir, ..., season_year, init_date)— the single source of truth for all artefact paths underforecast/{season_year}/{init_date}/.materialise_forecast_indices(config, results)— builds the daily weather indices zarr atresults.indices_zarr. Splices real observations up toinit_datewith climatology analogue beyond it._build_forecast_features(config, results)(run_forecast.py:264) — overrides the weather builder to readresults.indices_zarr, narrows the year window toseason_year, calls the long-range stubs (synthesise_long_range_climo_for_unseen_years,synthesise_long_range_stress_for_unseen_years) for years beyond zarr coverage, thenbuild_features(forecast_cfg, output_dir=results.features_dir)._impute_forecast_area(config, results)— fills NaNarea_harvested_havia 3-year trailing median from the canonicalpred.parquet(read-only reference).
4. run_predict() — Predict + postprocess + deliver (run_forecast.py:221)¶
run_preflight(preflight_paths_for_forecast_predict(...)).ForecastSlice(...)— path handle.run_predict_stage(run_dir, season_year=season_year, init_date=init_date)— alias tostages.run_predict.run_predict; loads the production model, scores the forecast feature parquet, writeswalk_forward_preds.parquetunder the production fold path._postprocess_forecast(experiment, results)— single-production-fold postprocess: area-weights to ADM0, fits and saves the bias corrector, callsprimary_calibrationto load the CI sidecar, callsbuild_rowsto assemble national rows with bias and CI columns, writesresults.postprocessed_national_path._deliver_forecast(experiment, results)— for each of("ADM0", "ADM1", "ADM2"): callswalk_forward_preds_to_delivery_rows(mode="forecast"), validates throughHindcastDelivery, serialises to a Polars DataFrame, applies post-transforms, writesTreefera_{key}_{level}_Forecast_{init_date}.csv.
Mermaid Flow¶
flowchart TD
CLI["cli run forecast\n--run-dir --season-year --init-date"]
ENTRY["run()\nrun_forecast.py:143"]
GUARD["validate_residual_mode()\nrun_forecast.py:91\nfail-fast before any I/O"]
FE["run_features()\nrun_forecast.py:167"]
IDXZ["materialise_forecast_indices()\n→ indices.zarr"]
STUB["long-range climo/stress stubs\nforecast_long_range_stub.py"]
BFEAT["build_features(forecast_cfg)\n→ features/pred.parquet"]
AREA["_impute_forecast_area()\ntrailing-3yr median\nfrom canonical pred.parquet\n(read-only)"]
RP["run_predict()\nrun_forecast.py:221"]
PRED["run_predict_stage()\nstages/run_predict.py\n→ walk_forward_preds.parquet"]
POST["_postprocess_forecast()\naggregate ADM0 + bias_corrector\n+ primary_calibration\n→ postprocessed/national.parquet"]
DEL["_deliver_forecast()\nADM0/ADM1/ADM2 CSVs\n→ forecast/{sy}/{id}/delivery/"]
CLI --> ENTRY
ENTRY --> GUARD
GUARD --> FE
FE --> IDXZ
IDXZ --> STUB
STUB --> BFEAT
BFEAT --> AREA
AREA --> RP
RP --> PRED
PRED --> POST
POST --> DEL
HINDCAST["Completed run_dir\n(production model + conformal sidecars)"]
HINDCAST -. "read-only reference" .-> ENTRY
Invariants¶
Read-only-on-canonical-hindcast invariant (DESIGN.md:125): the forecast pipeline SHALL NOT write to canonical hindcast artefacts under {features_dir}/{experiment_key}/. _impute_forecast_area reads the canonical pred.parquet but writes only to results.features_parquet under the forecast sub-directory.
ForecastSlice.root as single path source of truth (lib/results/results_slice.py): all forecast artefact paths are derived from run_dir / 'forecast' / str(season_year) / f'{init_date:%Y-%m-%d}'. Hard-coding the path string anywhere else is a bug.
forecast.residual_mode is mandatory: there is no default value. A config YAML without this field will fail Pydantic validation before run() is entered.
validate_residual_mode must be the first call: PR #372 mandates this. Moving it after feature build would waste minutes of compute on an incompatible run_dir.
Production model fold resolution (run_predict.py:86–98): tries models/{experiment_key}/production/ first; falls back to models/{experiment_key}/{season_year}/. Raises FileNotFoundError if neither exists.
Failure Modes¶
validate_residual_moderaises: most likely cause is invokingmake forecastwithout a priormake hindcastorcli run fit-production. Error message is actionable and includes the exact next command.- Long-range climo stub emits warnings: three
WARNINGlog lines appear whenseason_yearis beyond the zarr's coverage. This is normal behaviour; the stub fills with trailing medians and the forecast collapses to trend-only (see multi_year_forecast.md). _impute_forecast_areaFileNotFoundError: canonicalpred.parquetis absent (features have not been built for the hindcast). The forecast must reference the historical feature matrix.- Skip-if-exists reuse of stale artefacts: if
force=Falseand an earlier forecast for the same(season_year, init_date)exists, the feature build is skipped. This is correct for reruns but will not pick up updated raw-weather files. Passforce=Trueto rebuild. - OOS calibration empty rows:
CalibrationResult.to_frame()now raises with diagnostic context (PR #372 defensive guard) instead of bareKeyError: 'fold_year'.
See Also¶
- ForecastSlice — the path-handle entity for each
(season_year, init_date)artefact subtree; single source of truth for all forecast paths
Cross-references¶
- stages.md —
run_forecast.pyfull function signatures - delivery.md —
walk_forward_preds_to_delivery_rows,DeliveryRow,HindcastDelivery - PR-369.md — path restructure
forecast/{init_date}/ → forecast/{season_year}/{init_date}/ - PR-372.md —
forecast.residual_modemandatory +validate_residual_modegate - multi_year_forecast.md — multi-season_year orchestration and long-range stub
- deliver.md — hindcast (CV-fold) delivery, parallel code path
PRs¶
- PR #369 — restructured forecast artefact paths from
forecast/{init_date}/toforecast/{season_year}/{init_date}/and introduced the long-range climo stub. - PR #372 — made
forecast.residual_modemandatory and addedvalidate_residual_modeas the first call inrun().