Pipeline: Evaluate¶

Purpose¶

The EVALUATE pipeline is the read-only audit layer that runs immediately after POSTPROCESS in a full hindcast. It accepts a completed run_dir (FIT + PREDICT + POSTPROCESS artefacts present), scores each out-of-sample fold against NASS and WASDE/CONAB benchmarks, writes text and CSV reports, and generates 15 diagnostic PNG files. No models are re-fitted and no predictions are recomputed.

Inputs¶

Artefact	Path	Consumer
Per-fold `walk_forward_preds.parquet`	`preds/{experiment_key}/{fold_label}/`	`metrics.py`, `runners.py`, all plot prep modules
Per-fold `bias_corrector.pkl`	`models/{experiment_key}/{fold_label}/`	`metrics.py::gen_metrics`
`postprocessed/national.parquet`	`run_dir/postprocessed/`	`prep/delivery.py` for stage-7 plots
`delivery/Treefera__ADM0_Hindcast_.csv`	`run_dir/delivery/`	`prep/delivery.py`
`conformal/{mode}.parquet`	`run_dir/conformal/`	`prep/delivery.py`
Fitted detrender (`detrender.pkl`)	`models/{experiment_key}/{fold_label}/`	`prep/detrend.py`, `prep/trend_evolution.py`

Outputs¶

Artefact	Path	Producer
`metrics/raw_fold_metrics.json`	`run_dir/metrics/`	`metrics.py::compute_metrics`
`metrics_table.csv`	`run_dir/reports/`	`metrics.py::_write_metrics_table_csv`
`stage5_metrics.txt`	`run_dir/reports/`	`runners.py::write_rolling_forecast_metrics_files`
`stage5_metrics_ADM1.txt`	`run_dir/reports/`	`runners.py::write_rolling_forecast_metrics_files`
`stage5_metrics_ADM2.txt`	`run_dir/reports/`	`runners.py::write_rolling_forecast_metrics_files`
`rolling_forecast.png`	`run_dir/reports/`	`plots/fns/rolling_forecast.py` (cross-fold)
`improvement_heatmap.png`	`run_dir/reports/`	`plots/fns/improvement_heatmap.py` (cross-fold)
`information_advantage.png`	`run_dir/reports/`	`plots/fns/information_advantage.py` (cross-fold)
`benchmark_grid.png`	`run_dir/reports/`	`plots/fns/benchmark_grid.py` (cross-fold)
`scatter.png`	`run_dir/reports/`	`plots/fns/scatter.py` (cross-fold)
`detrended_scatter.png`	`run_dir/reports/`	`plots/fns/detrended_scatter.py` (cross-fold)
`pdp_pc_space.png`	`run_dir/reports/`	`plots/fns/pdp.py::plot_pdp_pc_space` (cross-fold)
`stage7a_delivery_forecast.png`	`run_dir/reports/`	`plots/fns/delivery.py` (cross-fold)
`stage7b_delivery_uncertainty.png`	`run_dir/reports/`	`plots/fns/delivery.py` (cross-fold)
`stage7c_delivery_reference_comparison.png`	`run_dir/reports/`	`plots/fns/delivery.py` (cross-fold)
`stage7d_delivery_weather_correction.png`	`run_dir/reports/`	`plots/fns/delivery.py` (cross-fold)
`{fold_label}_detrend_quality.png`	`run_dir/reports/hindcast/`	`plots/fns/detrend.py` (per-fold)
`{fold_label}_trend_fit_grid.png`	`run_dir/reports/hindcast/`	`plots/fns/trend_evolution.py` (per-fold)
`{fold_label}_trend_fit_detrended.png`	`run_dir/reports/hindcast/`	`plots/fns/trend_evolution.py` (per-fold)
`{fold_label}_residual_predictability.png`	`run_dir/reports/hindcast/`	`plots/fns/residual_predictability.py` (per-fold)
`{fold_label}_pdp_feature_space_*.png`	`run_dir/reports/hindcast/`	`plots/fns/pdp.py::plot_pdp_feature_space` (per-fold, chunked)

Step-by-step¶

1. Entry — `evaluate_experiment` (`run_diagnostics.py:12`)¶

run_hindcast.run() calls evaluate_experiment(run_root) as step 10 after POSTPROCESS. The function can also be invoked independently via the CLI for incremental re-runs.

2. `compute_metrics` (`metrics.py:239`)¶

Iterates all hindcast_slices (one per OOS fold). For each fold:

Area-weights sim_yield_kg_ha to national via aggregate_weighted_frame.
Loads the fold's persisted AbstractBiasCorrector and calls apply_national.
Computes per-spec MAE against each ReferenceYieldLoader (WASDE, CONAB, etc.).
Computes county-level MAE/RMSE vs NASS prod/area and survey yield.
All values remain in kg/ha here; conversion is deferred.

Returns list[dict[str, float]]; also persists metrics/raw_fold_metrics.json.

3. `write_metrics_artefacts` (`metrics.py:407`)¶

Only executed when skip_plots=False (the default). For each fold:

Re-calls gen_metrics for the raw kg/ha dict.
Calls add_rolling_forecast_metrics_for_reporting (runners.py:148) which populates rolling_forecast_data, rolling_forecast_adm1_oos, rolling_forecast_adm2_oos in the dict.

Then:

Calls write_rolling_forecast_metrics_files → stage5_metrics*.txt (ISO weeks 19+, bu/ac).
Calls _convert_metrics_to_bu_acre in-place (renames _kg_ha keys to _bu_ac).
Calls _write_metrics_table_csv → metrics_table.csv with per-fold rows plus a trailing mean row.
Calls _log_metrics_table_to_mlflow when an active MLflow run exists.

4. `generate_plots` (`plots/init.py:21`)¶

Constructs PlotRunner(result, output_dir) and calls runner.run().

The registry (registry.py:77) defines eight PlotGroup instances in two scopes:

Cross-fold groups (run once on the full ExperimentResult): rolling_forecast, improvement_heatmap, information_advantage, benchmark_grid, scatter, detrended_scatter, pdp_pc_space, delivery (four specs).

Per-fold groups (run once per HindcastSlice): detrend, trend_evolution, residual_predictability, pdp.

For each group, runner.run() calls group.prepare_data(result[, fold]) once, then _run_spec(spec, df, ...) per spec. Each spec calls the pure plot function and passes the returned Figure to _save_png. Local saves are atomic (.tmp.png → rename). Cloud saves use CloudPath.write_bytes.

Mermaid Flow¶

flowchart TD
    A["evaluate_experiment(run_dir)\nrun_diagnostics.py:12"]
    B["compute_metrics(run_dir)\nmetrics.py:239"]
    C["write_metrics_artefacts(run_dir)\nmetrics.py:407"]
    D["gen_metrics per fold\nmetrics.py:109"]
    E["add_rolling_forecast_metrics_for_reporting\nrunners.py:148"]
    F["write_rolling_forecast_metrics_files\nrunners.py:676\n→ stage5_metrics*.txt"]
    G["_convert_metrics_to_bu_acre\n→ metrics_table.csv + MLflow"]
    H["generate_plots(run_dir)\nplots/__init__.py:21"]
    I["PlotRunner.run()\nget_plot_registry()"]
    J["cross_fold groups\nprepare_data(result)\n→ reports/*.png"]
    K["per_fold groups\nprepare_data(result, fold)\n→ reports/hindcast/*.png"]

    A --> B
    A --> C
    B --> D
    C --> D
    C --> E
    E --> F
    C --> G
    A --> H
    H --> I
    I --> J
    I --> K

Invariants¶

All scoring in gen_metrics uses kg/ha throughout; bu/ac conversion happens only in _convert_metrics_to_bu_acre at the metrics_table.csv boundary (metrics.py:336–354).
_METRICS_MIN_WEEK = 19 (runners.py:208) filters ISO weeks 1–18 from the text reports, excluding pre-WASDE rows from rolling metrics.
NaN reference values (no release before init_date) are kept in the rolling DataFrame for plot completeness but excluded from numeric error averages.
Plot functions in fns/ are pure: they accept a DataFrame and typed kwargs and return a Figure. No disk access, no config loading inside a plot function.
skip_plots=True allows the CLI to compute metrics without the heavy matplotlib import chain.

Failure Modes¶

FileNotFoundError on bias_corrector.pkl: POSTPROCESS did not complete or was interrupted. Re-run cli run postprocess.
Empty hindcast_slices: no walk-forward CV folds exist (only a fit_production run_dir). compute_metrics returns an empty list; write_metrics_artefacts is a no-op.
Cloud PNG write failure: CloudPath.write_bytes requires network access. On dev EC2 with $INPUT_DATA_DIR pointing at S3, plots will fail if credentials are not configured.
prep/pdp.py calling .predict(): this is the only prep module that re-invokes a model method (on a static grid, not live data). If the regressor's predict signature changes, PDPs fail at plot time, not at fit time.
MLflow DB locking (tracked in project MEMORY.md): concurrent evaluate calls for the same commodity can cause OperationalError. Run evaluate sequentially per commodity.

Cross-references¶

stages.md — run_diagnostics.py function signatures
diagnostics.md — full module-level detail for metrics.py and runners.py
plots.md — full plot inventory, PlotSpec/PlotGroup registry
deliver.md — downstream stage that consumes the same walk_forward_preds.parquet
forecast.md — forecast pipeline which runs its own postprocess but reuses metrics concepts

PRs¶

No dedicated PR for the evaluate pipeline in the tracked window. Plot subsystem introduced progressively; PR #340 added window-aware MAPE (dashboard side only). See dashboard.md for window-aware scoring context.