Skip to content

Pipeline: Evaluate

Purpose

The EVALUATE pipeline is the read-only audit layer that runs immediately after POSTPROCESS in a full hindcast. It accepts a completed run_dir (FIT + PREDICT + POSTPROCESS artefacts present), scores each out-of-sample fold against NASS and WASDE/CONAB benchmarks, writes text and CSV reports, and generates 15 diagnostic PNG files. No models are re-fitted and no predictions are recomputed.

Inputs

Artefact Path Consumer
Per-fold walk_forward_preds.parquet preds/{experiment_key}/{fold_label}/ metrics.py, runners.py, all plot prep modules
Per-fold bias_corrector.pkl models/{experiment_key}/{fold_label}/ metrics.py::gen_metrics
postprocessed/national.parquet run_dir/postprocessed/ prep/delivery.py for stage-7 plots
delivery/Treefera_*_ADM0_Hindcast_*.csv run_dir/delivery/ prep/delivery.py
conformal/{mode}.parquet run_dir/conformal/ prep/delivery.py
Fitted detrender (detrender.pkl) models/{experiment_key}/{fold_label}/ prep/detrend.py, prep/trend_evolution.py

Outputs

Artefact Path Producer
metrics/raw_fold_metrics.json run_dir/metrics/ metrics.py::compute_metrics
metrics_table.csv run_dir/reports/ metrics.py::_write_metrics_table_csv
stage5_metrics.txt run_dir/reports/ runners.py::write_rolling_forecast_metrics_files
stage5_metrics_ADM1.txt run_dir/reports/ runners.py::write_rolling_forecast_metrics_files
stage5_metrics_ADM2.txt run_dir/reports/ runners.py::write_rolling_forecast_metrics_files
rolling_forecast.png run_dir/reports/ plots/fns/rolling_forecast.py (cross-fold)
improvement_heatmap.png run_dir/reports/ plots/fns/improvement_heatmap.py (cross-fold)
information_advantage.png run_dir/reports/ plots/fns/information_advantage.py (cross-fold)
benchmark_grid.png run_dir/reports/ plots/fns/benchmark_grid.py (cross-fold)
scatter.png run_dir/reports/ plots/fns/scatter.py (cross-fold)
detrended_scatter.png run_dir/reports/ plots/fns/detrended_scatter.py (cross-fold)
pdp_pc_space.png run_dir/reports/ plots/fns/pdp.py::plot_pdp_pc_space (cross-fold)
stage7a_delivery_forecast.png run_dir/reports/ plots/fns/delivery.py (cross-fold)
stage7b_delivery_uncertainty.png run_dir/reports/ plots/fns/delivery.py (cross-fold)
stage7c_delivery_reference_comparison.png run_dir/reports/ plots/fns/delivery.py (cross-fold)
stage7d_delivery_weather_correction.png run_dir/reports/ plots/fns/delivery.py (cross-fold)
{fold_label}_detrend_quality.png run_dir/reports/hindcast/ plots/fns/detrend.py (per-fold)
{fold_label}_trend_fit_grid.png run_dir/reports/hindcast/ plots/fns/trend_evolution.py (per-fold)
{fold_label}_trend_fit_detrended.png run_dir/reports/hindcast/ plots/fns/trend_evolution.py (per-fold)
{fold_label}_residual_predictability.png run_dir/reports/hindcast/ plots/fns/residual_predictability.py (per-fold)
{fold_label}_pdp_feature_space_*.png run_dir/reports/hindcast/ plots/fns/pdp.py::plot_pdp_feature_space (per-fold, chunked)

Step-by-step

1. Entry — evaluate_experiment (run_diagnostics.py:12)

run_hindcast.run() calls evaluate_experiment(run_root) as step 10 after POSTPROCESS. The function can also be invoked independently via the CLI for incremental re-runs.

2. compute_metrics (metrics.py:239)

Iterates all hindcast_slices (one per OOS fold). For each fold:

  • Area-weights sim_yield_kg_ha to national via aggregate_weighted_frame.
  • Loads the fold's persisted AbstractBiasCorrector and calls apply_national.
  • Computes per-spec MAE against each ReferenceYieldLoader (WASDE, CONAB, etc.).
  • Computes county-level MAE/RMSE vs NASS prod/area and survey yield.
  • All values remain in kg/ha here; conversion is deferred.

Returns list[dict[str, float]]; also persists metrics/raw_fold_metrics.json.

3. write_metrics_artefacts (metrics.py:407)

Only executed when skip_plots=False (the default). For each fold:

  • Re-calls gen_metrics for the raw kg/ha dict.
  • Calls add_rolling_forecast_metrics_for_reporting (runners.py:148) which populates rolling_forecast_data, rolling_forecast_adm1_oos, rolling_forecast_adm2_oos in the dict.

Then:

  • Calls write_rolling_forecast_metrics_filesstage5_metrics*.txt (ISO weeks 19+, bu/ac).
  • Calls _convert_metrics_to_bu_acre in-place (renames _kg_ha keys to _bu_ac).
  • Calls _write_metrics_table_csvmetrics_table.csv with per-fold rows plus a trailing mean row.
  • Calls _log_metrics_table_to_mlflow when an active MLflow run exists.

4. generate_plots (plots/__init__.py:21)

Constructs PlotRunner(result, output_dir) and calls runner.run().

The registry (registry.py:77) defines eight PlotGroup instances in two scopes:

Cross-fold groups (run once on the full ExperimentResult): rolling_forecast, improvement_heatmap, information_advantage, benchmark_grid, scatter, detrended_scatter, pdp_pc_space, delivery (four specs).

Per-fold groups (run once per HindcastSlice): detrend, trend_evolution, residual_predictability, pdp.

For each group, runner.run() calls group.prepare_data(result[, fold]) once, then _run_spec(spec, df, ...) per spec. Each spec calls the pure plot function and passes the returned Figure to _save_png. Local saves are atomic (.tmp.pngrename). Cloud saves use CloudPath.write_bytes.

Mermaid Flow

flowchart TD
    A["evaluate_experiment(run_dir)\nrun_diagnostics.py:12"]
    B["compute_metrics(run_dir)\nmetrics.py:239"]
    C["write_metrics_artefacts(run_dir)\nmetrics.py:407"]
    D["gen_metrics per fold\nmetrics.py:109"]
    E["add_rolling_forecast_metrics_for_reporting\nrunners.py:148"]
    F["write_rolling_forecast_metrics_files\nrunners.py:676\n→ stage5_metrics*.txt"]
    G["_convert_metrics_to_bu_acre\n→ metrics_table.csv + MLflow"]
    H["generate_plots(run_dir)\nplots/__init__.py:21"]
    I["PlotRunner.run()\nget_plot_registry()"]
    J["cross_fold groups\nprepare_data(result)\n→ reports/*.png"]
    K["per_fold groups\nprepare_data(result, fold)\n→ reports/hindcast/*.png"]

    A --> B
    A --> C
    B --> D
    C --> D
    C --> E
    E --> F
    C --> G
    A --> H
    H --> I
    I --> J
    I --> K

Invariants

  • All scoring in gen_metrics uses kg/ha throughout; bu/ac conversion happens only in _convert_metrics_to_bu_acre at the metrics_table.csv boundary (metrics.py:336–354).
  • _METRICS_MIN_WEEK = 19 (runners.py:208) filters ISO weeks 1–18 from the text reports, excluding pre-WASDE rows from rolling metrics.
  • NaN reference values (no release before init_date) are kept in the rolling DataFrame for plot completeness but excluded from numeric error averages.
  • Plot functions in fns/ are pure: they accept a DataFrame and typed kwargs and return a Figure. No disk access, no config loading inside a plot function.
  • skip_plots=True allows the CLI to compute metrics without the heavy matplotlib import chain.

Failure Modes

  • FileNotFoundError on bias_corrector.pkl: POSTPROCESS did not complete or was interrupted. Re-run cli run postprocess.
  • Empty hindcast_slices: no walk-forward CV folds exist (only a fit_production run_dir). compute_metrics returns an empty list; write_metrics_artefacts is a no-op.
  • Cloud PNG write failure: CloudPath.write_bytes requires network access. On dev EC2 with $INPUT_DATA_DIR pointing at S3, plots will fail if credentials are not configured.
  • prep/pdp.py calling .predict(): this is the only prep module that re-invokes a model method (on a static grid, not live data). If the regressor's predict signature changes, PDPs fail at plot time, not at fit time.
  • MLflow DB locking (tracked in project MEMORY.md): concurrent evaluate calls for the same commodity can cause OperationalError. Run evaluate sequentially per commodity.

Cross-references

  • stages.mdrun_diagnostics.py function signatures
  • diagnostics.md — full module-level detail for metrics.py and runners.py
  • plots.md — full plot inventory, PlotSpec/PlotGroup registry
  • deliver.md — downstream stage that consumes the same walk_forward_preds.parquet
  • forecast.md — forecast pipeline which runs its own postprocess but reuses metrics concepts

PRs

No dedicated PR for the evaluate pipeline in the tracked window. Plot subsystem introduced progressively; PR #340 added window-aware MAPE (dashboard side only). See dashboard.md for window-aware scoring context.