Pipeline: Evaluate¶
Purpose¶
The EVALUATE pipeline is the read-only audit layer that runs immediately after POSTPROCESS in a full hindcast. It accepts a completed run_dir (FIT + PREDICT + POSTPROCESS artefacts present), scores each out-of-sample fold against NASS and WASDE/CONAB benchmarks, writes text and CSV reports, and generates 15 diagnostic PNG files. No models are re-fitted and no predictions are recomputed.
Inputs¶
| Artefact | Path | Consumer |
|---|---|---|
Per-fold walk_forward_preds.parquet |
preds/{experiment_key}/{fold_label}/ |
metrics.py, runners.py, all plot prep modules |
Per-fold bias_corrector.pkl |
models/{experiment_key}/{fold_label}/ |
metrics.py::gen_metrics |
postprocessed/national.parquet |
run_dir/postprocessed/ |
prep/delivery.py for stage-7 plots |
delivery/Treefera_*_ADM0_Hindcast_*.csv |
run_dir/delivery/ |
prep/delivery.py |
conformal/{mode}.parquet |
run_dir/conformal/ |
prep/delivery.py |
Fitted detrender (detrender.pkl) |
models/{experiment_key}/{fold_label}/ |
prep/detrend.py, prep/trend_evolution.py |
Outputs¶
| Artefact | Path | Producer |
|---|---|---|
metrics/raw_fold_metrics.json |
run_dir/metrics/ |
metrics.py::compute_metrics |
metrics_table.csv |
run_dir/reports/ |
metrics.py::_write_metrics_table_csv |
stage5_metrics.txt |
run_dir/reports/ |
runners.py::write_rolling_forecast_metrics_files |
stage5_metrics_ADM1.txt |
run_dir/reports/ |
runners.py::write_rolling_forecast_metrics_files |
stage5_metrics_ADM2.txt |
run_dir/reports/ |
runners.py::write_rolling_forecast_metrics_files |
rolling_forecast.png |
run_dir/reports/ |
plots/fns/rolling_forecast.py (cross-fold) |
improvement_heatmap.png |
run_dir/reports/ |
plots/fns/improvement_heatmap.py (cross-fold) |
information_advantage.png |
run_dir/reports/ |
plots/fns/information_advantage.py (cross-fold) |
benchmark_grid.png |
run_dir/reports/ |
plots/fns/benchmark_grid.py (cross-fold) |
scatter.png |
run_dir/reports/ |
plots/fns/scatter.py (cross-fold) |
detrended_scatter.png |
run_dir/reports/ |
plots/fns/detrended_scatter.py (cross-fold) |
pdp_pc_space.png |
run_dir/reports/ |
plots/fns/pdp.py::plot_pdp_pc_space (cross-fold) |
stage7a_delivery_forecast.png |
run_dir/reports/ |
plots/fns/delivery.py (cross-fold) |
stage7b_delivery_uncertainty.png |
run_dir/reports/ |
plots/fns/delivery.py (cross-fold) |
stage7c_delivery_reference_comparison.png |
run_dir/reports/ |
plots/fns/delivery.py (cross-fold) |
stage7d_delivery_weather_correction.png |
run_dir/reports/ |
plots/fns/delivery.py (cross-fold) |
{fold_label}_detrend_quality.png |
run_dir/reports/hindcast/ |
plots/fns/detrend.py (per-fold) |
{fold_label}_trend_fit_grid.png |
run_dir/reports/hindcast/ |
plots/fns/trend_evolution.py (per-fold) |
{fold_label}_trend_fit_detrended.png |
run_dir/reports/hindcast/ |
plots/fns/trend_evolution.py (per-fold) |
{fold_label}_residual_predictability.png |
run_dir/reports/hindcast/ |
plots/fns/residual_predictability.py (per-fold) |
{fold_label}_pdp_feature_space_*.png |
run_dir/reports/hindcast/ |
plots/fns/pdp.py::plot_pdp_feature_space (per-fold, chunked) |
Step-by-step¶
1. Entry — evaluate_experiment (run_diagnostics.py:12)¶
run_hindcast.run() calls evaluate_experiment(run_root) as step 10 after POSTPROCESS. The function can also be invoked independently via the CLI for incremental re-runs.
2. compute_metrics (metrics.py:239)¶
Iterates all hindcast_slices (one per OOS fold). For each fold:
- Area-weights
sim_yield_kg_hato national viaaggregate_weighted_frame. - Loads the fold's persisted
AbstractBiasCorrectorand callsapply_national. - Computes per-spec MAE against each
ReferenceYieldLoader(WASDE, CONAB, etc.). - Computes county-level MAE/RMSE vs NASS prod/area and survey yield.
- All values remain in kg/ha here; conversion is deferred.
Returns list[dict[str, float]]; also persists metrics/raw_fold_metrics.json.
3. write_metrics_artefacts (metrics.py:407)¶
Only executed when skip_plots=False (the default). For each fold:
- Re-calls
gen_metricsfor the raw kg/ha dict. - Calls
add_rolling_forecast_metrics_for_reporting(runners.py:148) which populatesrolling_forecast_data,rolling_forecast_adm1_oos,rolling_forecast_adm2_oosin the dict.
Then:
- Calls
write_rolling_forecast_metrics_files→stage5_metrics*.txt(ISO weeks 19+, bu/ac). - Calls
_convert_metrics_to_bu_acrein-place (renames_kg_hakeys to_bu_ac). - Calls
_write_metrics_table_csv→metrics_table.csvwith per-fold rows plus a trailingmeanrow. - Calls
_log_metrics_table_to_mlflowwhen an active MLflow run exists.
4. generate_plots (plots/__init__.py:21)¶
Constructs PlotRunner(result, output_dir) and calls runner.run().
The registry (registry.py:77) defines eight PlotGroup instances in two scopes:
Cross-fold groups (run once on the full ExperimentResult): rolling_forecast, improvement_heatmap, information_advantage, benchmark_grid, scatter, detrended_scatter, pdp_pc_space, delivery (four specs).
Per-fold groups (run once per HindcastSlice): detrend, trend_evolution, residual_predictability, pdp.
For each group, runner.run() calls group.prepare_data(result[, fold]) once, then _run_spec(spec, df, ...) per spec. Each spec calls the pure plot function and passes the returned Figure to _save_png. Local saves are atomic (.tmp.png → rename). Cloud saves use CloudPath.write_bytes.
Mermaid Flow¶
flowchart TD
A["evaluate_experiment(run_dir)\nrun_diagnostics.py:12"]
B["compute_metrics(run_dir)\nmetrics.py:239"]
C["write_metrics_artefacts(run_dir)\nmetrics.py:407"]
D["gen_metrics per fold\nmetrics.py:109"]
E["add_rolling_forecast_metrics_for_reporting\nrunners.py:148"]
F["write_rolling_forecast_metrics_files\nrunners.py:676\n→ stage5_metrics*.txt"]
G["_convert_metrics_to_bu_acre\n→ metrics_table.csv + MLflow"]
H["generate_plots(run_dir)\nplots/__init__.py:21"]
I["PlotRunner.run()\nget_plot_registry()"]
J["cross_fold groups\nprepare_data(result)\n→ reports/*.png"]
K["per_fold groups\nprepare_data(result, fold)\n→ reports/hindcast/*.png"]
A --> B
A --> C
B --> D
C --> D
C --> E
E --> F
C --> G
A --> H
H --> I
I --> J
I --> K
Invariants¶
- All scoring in
gen_metricsuses kg/ha throughout; bu/ac conversion happens only in_convert_metrics_to_bu_acreat themetrics_table.csvboundary (metrics.py:336–354). _METRICS_MIN_WEEK = 19(runners.py:208) filters ISO weeks 1–18 from the text reports, excluding pre-WASDE rows from rolling metrics.- NaN reference values (no release before
init_date) are kept in the rolling DataFrame for plot completeness but excluded from numeric error averages. - Plot functions in
fns/are pure: they accept a DataFrame and typed kwargs and return aFigure. No disk access, no config loading inside a plot function. skip_plots=Trueallows the CLI to compute metrics without the heavy matplotlib import chain.
Failure Modes¶
FileNotFoundErroronbias_corrector.pkl: POSTPROCESS did not complete or was interrupted. Re-runcli run postprocess.- Empty
hindcast_slices: no walk-forward CV folds exist (only afit_productionrun_dir).compute_metricsreturns an empty list;write_metrics_artefactsis a no-op. - Cloud PNG write failure:
CloudPath.write_bytesrequires network access. On dev EC2 with$INPUT_DATA_DIRpointing at S3, plots will fail if credentials are not configured. prep/pdp.pycalling.predict(): this is the only prep module that re-invokes a model method (on a static grid, not live data). If the regressor'spredictsignature changes, PDPs fail at plot time, not at fit time.- MLflow DB locking (tracked in project MEMORY.md): concurrent evaluate calls for the same commodity can cause
OperationalError. Run evaluate sequentially per commodity.
Cross-references¶
- stages.md —
run_diagnostics.pyfunction signatures - diagnostics.md — full module-level detail for
metrics.pyandrunners.py - plots.md — full plot inventory,
PlotSpec/PlotGroupregistry - deliver.md — downstream stage that consumes the same
walk_forward_preds.parquet - forecast.md — forecast pipeline which runs its own postprocess but reuses metrics concepts
PRs¶
No dedicated PR for the evaluate pipeline in the tracked window. Plot subsystem introduced progressively; PR #340 added window-aware MAPE (dashboard side only). See dashboard.md for window-aware scoring context.