Skip to content

Source: diagnostics (metrics + reports)

Overview

The diagnostics/ package is a pure consumer of run-directory artefacts. It never re-invokes models, never re-fits detrenders, and never re-imputes data. Its sole job is to load saved predictions from disk and produce human-readable reports plus JSON metric caches.

The subsystem is split into two modules (plots are covered separately):

Module Lines Responsibility
diagnostics/metrics.py ~464 Per-fold benchmark scoring, metric conversion, metrics_table.csv, MLflow logging
diagnostics/runners.py ~697 Rolling national vs reference text reports; ADM1/ADM2 OOS error tables
diagnostics/__init__.py ~55 Package docstring only — no public re-exports

All outputs land under <run_dir>/reports/.

Modules

diagnostics/metrics.py

Public functions

gen_metrics(test_data, cfg, *, fold, nass_obs=None) → dict[str, float] (metrics.py:109)

Core per-fold scorer. Called once per hindcast fold. Accepts one fold's test rows (post inverse-transform) and returns a flat dict of error scalars. All internal comparisons are in kg/ha; bu/ac conversion is deferred to _convert_metrics_to_bu_acre at the metrics-table boundary.

Steps performed:

  1. Area-weighted national aggregation of sim_yield_kg_hasim_nat_kg_ha.
  2. Load the fold's persisted AbstractBiasCorrector via AbstractBiasCorrector.load(fold.bias_corrector_path) and call apply_national(sim_nat_kg_ha)sim_nat_adj_kg_ha (metrics.py:156–157).
  3. For each ReferenceYieldLoader in build_loaders(cfg): fetch yield_final(year) (kg/ha), compute |ref − sim_nat_adj|, store as f"{spec.name}_mae" and f"{spec.name}_final_kg_ha" (metrics.py:165–177).
  4. NASS national prod/area MAE: |nass_national_prod_div_area_kg_ha − sim_nat_adj| (metrics.py:181–189).
  5. NASS county prod/area MAE + RMSE: county-level join on geo_identifier (metrics.py:193–204).
  6. NASS county survey yield MAE + RMSE: same join pattern using target_col (metrics.py:207–217).
  7. NASS national survey yield MAE: area-weighted scalar from nass_national_survey_yield_area_weighted_kg_ha (metrics.py:220–230).

Metric keys emitted (all in kg/ha unless labelled _bu_ac_obs):

Key Description
bias_correction_kg_ha AbstractBiasCorrector.bias_kg_ha scalar
{spec.name}_mae |ref_final_kg_ha − sim_nat_adj_kg_ha| per reference spec
{spec.name}_final_kg_ha Reference final yield (renamed to _bu_ac after conversion)
nass_national_prod_div_area_mae National prod/area absolute error (kg/ha)
nass_national_prod_div_area_bu_ac_obs NASS national prod/area obs converted to bu/ac
nass_county_prod_div_area_mae County-level prod/area MAE (kg/ha)
nass_county_prod_div_area_rmse County-level prod/area RMSE (kg/ha)
nass_county_survey_yield_mae County-level survey yield MAE (kg/ha)
nass_county_survey_yield_rmse County-level survey yield RMSE (kg/ha)
nass_national_survey_yield_mae National survey yield absolute error (kg/ha)
nass_national_survey_bu_ac_obs NASS national survey obs in bu/ac
nass_national_survey_yield_area_weighted_bu_ac_obs Alias for above

compute_metrics(run_dir, *, postprocessed=False) → list[dict[str, float]] (metrics.py:239)

Orchestrator: loads ExperimentResult from disk, iterates over all hindcast_slices, calls gen_metrics per fold, appends a "label" key (= fold.fold_label), and persists the list to <run_dir>/metrics/raw_fold_metrics.json (or postprocessed_fold_metrics.json when postprocessed=True). Does not trigger bias re-fitting; reads the already-persisted corrector (metrics.py:256–288).

load_metrics(run_dir, *, postprocessed=False) → list[dict[str, float]] (metrics.py:292)

Read-only companion: deserialises the JSON written by compute_metrics. Raises FileNotFoundError if the JSON is absent.

write_metrics_artefacts(run_dir) → None (metrics.py:407)

High-level orchestrator called by stages/run_diagnostics.py:45. For each fold:

  1. Calls gen_metrics (raw kg/ha).
  2. Calls add_rolling_forecast_metrics_for_reporting (from runners.py) to attach rolling_forecast_data, rolling_forecast_adm1_oos, rolling_forecast_adm2_oos to the fold's dict.

Then:

  1. Calls write_rolling_forecast_metrics_files (raw, pre-conversion) → stage5_metrics.txt, stage5_metrics_ADM1.txt, stage5_metrics_ADM2.txt.
  2. Calls _convert_metrics_to_bu_acre in place.
  3. Calls _write_metrics_table_csvmetrics_table.csv with per-fold rows + a trailing mean row.
  4. Calls _log_metrics_table_to_mlflow when an MLflow run is active.

Unit conversion boundary (metrics.py:336–354)

_convert_metrics_to_bu_acre iterates over every float in each fold's metric dict. Keys ending in _kg_ha are renamed to _bu_ac and converted using kg_ha_to_bu_acre(v, bw) from lib/unit_utils.py:33. Keys already in bu/ac (listed in _METRIC_KEYS_ALREADY_BU_ACRE, metrics.py:310) are skipped to avoid double-conversion. The bushel_weight_lbs constant comes from ExperimentConfig.commodity.bushel_weight_lbs.

MLflow logging (metrics.py:357–380)

_log_metrics_table_to_mlflow logs every cell of metrics_table.csv as a flat {row_label}/{col_name}: float MLflow metric when has_active_run() returns True. Reference-level columns ({spec.name}_final_bu_ac, NASS obs constants, bias_correction_bu_ac) are excluded via _mlflow_excluded_metric_columns (metrics.py:318), because they are reference levels not error metrics.

Helper: _county (metrics.py:65)

NaN-safe MAE + RMSE over county distributions. Filters to np.isfinite(obs) & np.isfinite(sim) before calling sklearn.metrics.mean_absolute_error / root_mean_squared_error.

diagnostics/runners.py

Follows the DESIGN.md §71 rule: "reporting functions SHALL FOLLOW READ-then-PLOT — never predict()." All functions load walk_forward_preds.parquet via HindcastSlice.load_walk_forward_preds().

Constants

METRICS_KEY_ADM1_OOS = "rolling_forecast_adm1_oos"   # runners.py:40
METRICS_KEY_ADM2_OOS = "rolling_forecast_adm2_oos"   # runners.py:41
ARTIFACT_NATIONAL    = "stage5_metrics.txt"           # runners.py:43
ARTIFACT_ADM1        = "stage5_metrics_ADM1.txt"      # runners.py:44
ARTIFACT_ADM2        = "stage5_metrics_ADM2.txt"      # runners.py:45
_METRICS_MIN_WEEK    = 19                             # runners.py:208

_METRICS_MIN_WEEK = 19 filters out ISO weeks 1–18, which precede the first WASDE estimate of the crop year (~May). Rows with NaN reference-as-of values are kept in the raw data frame (so plots remain complete for cross-year crops like wheat) but excluded from the metrics tables.

aggregate_rolling_forecast_data_from_experiment (runners.py:55)

Loads the fold's wide rolling predictions, filters to scoreable rows (obs present, area > 0, _scoreable_yield_rows at runners.py:48), area-weights to national per init_date, and adds one f"{spec.name}_yield_kg_ha_asof_before_init" column per reference spec. Reference values are the latest release strictly before each init_date (see yield_asof_array_from_releases, runners.py:114). Returns a DataFrame indexed by init_date. Reference columns may carry NaN when no release exists yet.

Consumer responsibility (runners.py:76–77): any scorer that compares Treefera vs a reference MUST skip NaN reference rows explicitly.

add_rolling_forecast_metrics_for_reporting (runners.py:148)

Populates three keys on the passed metrics dict in place:

  • rolling_forecast_data — bias-corrected national rolling DataFrame (applies AbstractBiasCorrector.apply_frame to area_weighted_mean_forecast_yield).
  • rolling_forecast_adm1_oos — ADM1 per-init OOS table.
  • rolling_forecast_adm2_oos — ADM2 per-init OOS table.

ADM table generation is wrapped in a try/except; failures are logged as warnings rather than propagated (runners.py:164).

build_rolling_forecast_metrics_national_txt (runners.py:219)

Produces stage5_metrics.txt. One row per ISO calendar week (w19–w53). For each week and each OOS year, all init_date rows in that week are mean-averaged to yield a single error value; those yearly errors are then averaged across OOS years. Columns (all in bu/ac):

  • Model vs_<ACTUALS> — mean |fore_bu − nass_bu|
  • Model vs_<REF> — mean |fore_bu − ref_final_bu|
  • <REF> vs_<ACTUALS> — mean |ref_bu − nass_bu|
  • <REF> vs_<REF> — mean |ref_bu − ref_final_bu|
  • Improv%(REF_vs_NASS − Model_vs_NASS) / REF_vs_NASS × 100
  • Win — count of OOS years where weekly-mean |Model−NASS| < |REF−NASS|, denominator = total OOS years with valid NASS data (constant across rows).

<ACTUALS> = config.commodity.actuals_source_short; <REF> = first cfg.reference_data spec name (e.g. wasde, conab_final).

compute_rolling_forecast_adm_oos_tables (runners.py:414)

Returns a tuple (adm1_df, adm2_df). For each init_date in the fold:

  • ADM2 — raw county-level errors, bu/ac. Columns: fold (ISO week label), MAE, MdAE, RMSE, %Err, N (counties × years).
  • ADM1 — state-level area-weighted aggregation via aggregate_weighted_frame(..., level="ADM1"), then same error columns. N = states × years.

Uses kg_ha_to_bu_acre_array from lib/unit_utils.py:64 for vectorised conversion.

write_rolling_forecast_metrics_files (runners.py:676)

Writes all three text artefacts under output_dir using AnyPath (S3-safe). Returns a tuple of three paths. Calls build_rolling_forecast_metrics_national_txt, build_rolling_forecast_metrics_adm1_txt, build_rolling_forecast_metrics_adm2_txt in sequence.

build_rolling_forecast_metrics_adm1_txt / adm2_txt (runners.py:626, runners.py:651)

Format wrappers. Aggregate per-fold ADM tables across OOS years via _agg_adm_across_years (mean of MAE/MdAE/RMSE/%Err; sum of N), then call _format_adm_txt to produce a fixed-width text table with an OVERALL row computed by _adm_overall_row (N-weighted averages, runners.py:543).

diagnostics/__init__.py

Module-level docstring only (__init__.py:1–8). No public symbols are exported; callers import directly from metrics or runners.

Output artefacts

All outputs written under <run_dir>/reports/:

File Writer Content
metrics_table.csv _write_metrics_table_csv Per-fold rows (bu/ac) + trailing mean row
stage5_metrics.txt write_rolling_forecast_metrics_files National rolling MAE table (bu/ac, w19+)
stage5_metrics_ADM1.txt same State-level OOS MAE/MdAE/RMSE/%Err
stage5_metrics_ADM2.txt same County-level OOS MAE/MdAE/RMSE/%Err

Intermediate JSON is written to <run_dir>/metrics/raw_fold_metrics.json (or postprocessed_fold_metrics.json) by compute_metrics.

Unit conversion: lib/unit_utils.py

Scalar formula (unit_utils.py:33):

bu_ac = kg_ha × HA_PER_ACRE / (bushel_weight_lbs × KG_PER_LB)

Constants: HA_PER_ACRE = 0.404686, KG_PER_LB = 0.453592 (unit_utils.py:24–25). The module note acknowledges a ~0.00017 % systematic bias vs QUBE's higher-precision constants.

Vectorised variants (unit_utils.py:64, unit_utils.py:78): kg_ha_to_bu_acre_array (numpy) and kg_ha_to_bu_acre_series (pandas) apply the same scale factor. diagnostics/runners.py uses the array variant throughout ADM table construction.

Stage integration

stages/run_diagnostics.py:evaluate_experiment is the primary entry point called by the CLI. It delegates:

  1. compute_metrics(run_dir) — always.
  2. write_metrics_artefacts(run_dir) + generate_plots(run_dir) — unless skip_plots=True.

enrich_fold_metrics_for_reporting (run_diagnostics.py:51) is a thin wrapper that iterates paired (metrics_dict, reporting_context) tuples and calls add_rolling_forecast_metrics_for_reporting per fold.

Cross-references

  • lib/unit_utils.py — bu/ac conversion constants and vectorised helpers
  • diagnostics/plots/ — separate actor; consumes rolling_forecast_data key produced here
  • delivery/conversions.py — also imports aggregate_rolling_forecast_data_from_experiment (runners.py:23)
  • lib/results/results_slice.pyHindcastSlice / AbstractSlice fold handles
  • models/meta_models/bias_correction.pyAbstractBiasCorrector.load + apply_national / apply_frame
  • lib/reference_data/loader.pybuild_loaders, build_references_by_harvest_year, ReferenceYieldLoader

Relationships

stages/run_diagnostics.py
  └─ evaluate_experiment()
       ├─ diagnostics/metrics.py :: compute_metrics()
       └─ diagnostics/metrics.py :: write_metrics_artefacts()
            ├─ gen_metrics()  [per fold, kg/ha]
            ├─ diagnostics/runners.py :: add_rolling_forecast_metrics_for_reporting()
            │    ├─ aggregate_rolling_forecast_data_from_experiment()
            │    └─ compute_rolling_forecast_adm_oos_tables()
            ├─ runners :: write_rolling_forecast_metrics_files()
            │    → reports/stage5_metrics*.txt
            ├─ _convert_metrics_to_bu_acre()  [lib/unit_utils.py]
            └─ _write_metrics_table_csv()
                 ├─ → reports/metrics_table.csv
                 └─ _log_metrics_table_to_mlflow()