Source: diagnostics (metrics + reports)¶

Overview¶

The diagnostics/ package is a pure consumer of run-directory artefacts. It never re-invokes models, never re-fits detrenders, and never re-imputes data. Its sole job is to load saved predictions from disk and produce human-readable reports plus JSON metric caches.

The subsystem is split into two modules (plots are covered separately):

Module	Lines	Responsibility
`diagnostics/metrics.py`	~464	Per-fold benchmark scoring, metric conversion, `metrics_table.csv`, MLflow logging
`diagnostics/runners.py`	~697	Rolling national vs reference text reports; ADM1/ADM2 OOS error tables
`diagnostics/__init__.py`	~55	Package docstring only — no public re-exports

All outputs land under <run_dir>/reports/.

Modules¶

`diagnostics/metrics.py`¶

Public functions¶

gen_metrics(test_data, cfg, *, fold, nass_obs=None) → dict[str, float] (metrics.py:109)

Core per-fold scorer. Called once per hindcast fold. Accepts one fold's test rows (post inverse-transform) and returns a flat dict of error scalars. All internal comparisons are in kg/ha; bu/ac conversion is deferred to _convert_metrics_to_bu_acre at the metrics-table boundary.

Steps performed:

Area-weighted national aggregation of sim_yield_kg_ha → sim_nat_kg_ha.
Load the fold's persisted AbstractBiasCorrector via AbstractBiasCorrector.load(fold.bias_corrector_path) and call apply_national(sim_nat_kg_ha) → sim_nat_adj_kg_ha (metrics.py:156–157).
For each ReferenceYieldLoader in build_loaders(cfg): fetch yield_final(year) (kg/ha), compute |ref − sim_nat_adj|, store as f"{spec.name}_mae" and f"{spec.name}_final_kg_ha" (metrics.py:165–177).
NASS national prod/area MAE: |nass_national_prod_div_area_kg_ha − sim_nat_adj| (metrics.py:181–189).
NASS county prod/area MAE + RMSE: county-level join on geo_identifier (metrics.py:193–204).
NASS county survey yield MAE + RMSE: same join pattern using target_col (metrics.py:207–217).
NASS national survey yield MAE: area-weighted scalar from nass_national_survey_yield_area_weighted_kg_ha (metrics.py:220–230).

Metric keys emitted (all in kg/ha unless labelled _bu_ac_obs):

Key	Description
`bias_correction_kg_ha`	`AbstractBiasCorrector.bias_kg_ha` scalar
`{spec.name}_mae`	`\|ref_final_kg_ha − sim_nat_adj_kg_ha\|` per reference spec
`{spec.name}_final_kg_ha`	Reference final yield (renamed to `_bu_ac` after conversion)
`nass_national_prod_div_area_mae`	National prod/area absolute error (kg/ha)
`nass_national_prod_div_area_bu_ac_obs`	NASS national prod/area obs converted to bu/ac
`nass_county_prod_div_area_mae`	County-level prod/area MAE (kg/ha)
`nass_county_prod_div_area_rmse`	County-level prod/area RMSE (kg/ha)
`nass_county_survey_yield_mae`	County-level survey yield MAE (kg/ha)
`nass_county_survey_yield_rmse`	County-level survey yield RMSE (kg/ha)
`nass_national_survey_yield_mae`	National survey yield absolute error (kg/ha)
`nass_national_survey_bu_ac_obs`	NASS national survey obs in bu/ac
`nass_national_survey_yield_area_weighted_bu_ac_obs`	Alias for above

compute_metrics(run_dir, *, postprocessed=False) → list[dict[str, float]] (metrics.py:239)

Orchestrator: loads ExperimentResult from disk, iterates over all hindcast_slices, calls gen_metrics per fold, appends a "label" key (= fold.fold_label), and persists the list to <run_dir>/metrics/raw_fold_metrics.json (or postprocessed_fold_metrics.json when postprocessed=True). Does not trigger bias re-fitting; reads the already-persisted corrector (metrics.py:256–288).

load_metrics(run_dir, *, postprocessed=False) → list[dict[str, float]] (metrics.py:292)

Read-only companion: deserialises the JSON written by compute_metrics. Raises FileNotFoundError if the JSON is absent.

write_metrics_artefacts(run_dir) → None (metrics.py:407)

High-level orchestrator called by stages/run_diagnostics.py:45. For each fold:

Calls gen_metrics (raw kg/ha).
Calls add_rolling_forecast_metrics_for_reporting (from runners.py) to attach rolling_forecast_data, rolling_forecast_adm1_oos, rolling_forecast_adm2_oos to the fold's dict.

Then:

Calls write_rolling_forecast_metrics_files (raw, pre-conversion) → stage5_metrics.txt, stage5_metrics_ADM1.txt, stage5_metrics_ADM2.txt.
Calls _convert_metrics_to_bu_acre in place.
Calls _write_metrics_table_csv → metrics_table.csv with per-fold rows + a trailing mean row.
Calls _log_metrics_table_to_mlflow when an MLflow run is active.

Unit conversion boundary (`metrics.py:336–354`)¶

_convert_metrics_to_bu_acre iterates over every float in each fold's metric dict. Keys ending in _kg_ha are renamed to _bu_ac and converted using kg_ha_to_bu_acre(v, bw) from lib/unit_utils.py:33. Keys already in bu/ac (listed in _METRIC_KEYS_ALREADY_BU_ACRE, metrics.py:310) are skipped to avoid double-conversion. The bushel_weight_lbs constant comes from ExperimentConfig.commodity.bushel_weight_lbs.

MLflow logging (`metrics.py:357–380`)¶

_log_metrics_table_to_mlflow logs every cell of metrics_table.csv as a flat {row_label}/{col_name}: float MLflow metric when has_active_run() returns True. Reference-level columns ({spec.name}_final_bu_ac, NASS obs constants, bias_correction_bu_ac) are excluded via _mlflow_excluded_metric_columns (metrics.py:318), because they are reference levels not error metrics.

Helper: `_county` (`metrics.py:65`)¶

NaN-safe MAE + RMSE over county distributions. Filters to np.isfinite(obs) & np.isfinite(sim) before calling sklearn.metrics.mean_absolute_error / root_mean_squared_error.

`diagnostics/runners.py`¶

Follows the DESIGN.md §71 rule: "reporting functions SHALL FOLLOW READ-then-PLOT — never predict()." All functions load walk_forward_preds.parquet via HindcastSlice.load_walk_forward_preds().

Constants¶

METRICS_KEY_ADM1_OOS = "rolling_forecast_adm1_oos"   # runners.py:40
METRICS_KEY_ADM2_OOS = "rolling_forecast_adm2_oos"   # runners.py:41
ARTIFACT_NATIONAL    = "stage5_metrics.txt"           # runners.py:43
ARTIFACT_ADM1        = "stage5_metrics_ADM1.txt"      # runners.py:44
ARTIFACT_ADM2        = "stage5_metrics_ADM2.txt"      # runners.py:45
_METRICS_MIN_WEEK    = 19                             # runners.py:208

_METRICS_MIN_WEEK = 19 filters out ISO weeks 1–18, which precede the first WASDE estimate of the crop year (~May). Rows with NaN reference-as-of values are kept in the raw data frame (so plots remain complete for cross-year crops like wheat) but excluded from the metrics tables.

`aggregate_rolling_forecast_data_from_experiment` (`runners.py:55`)¶

Loads the fold's wide rolling predictions, filters to scoreable rows (obs present, area > 0, _scoreable_yield_rows at runners.py:48), area-weights to national per init_date, and adds one f"{spec.name}_yield_kg_ha_asof_before_init" column per reference spec. Reference values are the latest release strictly before each init_date (see yield_asof_array_from_releases, runners.py:114). Returns a DataFrame indexed by init_date. Reference columns may carry NaN when no release exists yet.

Consumer responsibility (runners.py:76–77): any scorer that compares Treefera vs a reference MUST skip NaN reference rows explicitly.

`add_rolling_forecast_metrics_for_reporting` (`runners.py:148`)¶

Populates three keys on the passed metrics dict in place:

rolling_forecast_data — bias-corrected national rolling DataFrame (applies AbstractBiasCorrector.apply_frame to area_weighted_mean_forecast_yield).
rolling_forecast_adm1_oos — ADM1 per-init OOS table.
rolling_forecast_adm2_oos — ADM2 per-init OOS table.

ADM table generation is wrapped in a try/except; failures are logged as warnings rather than propagated (runners.py:164).

`build_rolling_forecast_metrics_national_txt` (`runners.py:219`)¶

Produces stage5_metrics.txt. One row per ISO calendar week (w19–w53). For each week and each OOS year, all init_date rows in that week are mean-averaged to yield a single error value; those yearly errors are then averaged across OOS years. Columns (all in bu/ac):

Model vs_<ACTUALS> — mean |fore_bu − nass_bu|
Model vs_<REF> — mean |fore_bu − ref_final_bu|
<REF> vs_<ACTUALS> — mean |ref_bu − nass_bu|
<REF> vs_<REF> — mean |ref_bu − ref_final_bu|
Improv% — (REF_vs_NASS − Model_vs_NASS) / REF_vs_NASS × 100
Win — count of OOS years where weekly-mean |Model−NASS| < |REF−NASS|, denominator = total OOS years with valid NASS data (constant across rows).

<ACTUALS> = config.commodity.actuals_source_short; <REF> = first cfg.reference_data spec name (e.g. wasde, conab_final).

`compute_rolling_forecast_adm_oos_tables` (`runners.py:414`)¶

Returns a tuple (adm1_df, adm2_df). For each init_date in the fold:

ADM2 — raw county-level errors, bu/ac. Columns: fold (ISO week label), MAE, MdAE, RMSE, %Err, N (counties × years).
ADM1 — state-level area-weighted aggregation via aggregate_weighted_frame(..., level="ADM1"), then same error columns. N = states × years.

Uses kg_ha_to_bu_acre_array from lib/unit_utils.py:64 for vectorised conversion.

`write_rolling_forecast_metrics_files` (`runners.py:676`)¶

Writes all three text artefacts under output_dir using AnyPath (S3-safe). Returns a tuple of three paths. Calls build_rolling_forecast_metrics_national_txt, build_rolling_forecast_metrics_adm1_txt, build_rolling_forecast_metrics_adm2_txt in sequence.

`build_rolling_forecast_metrics_adm1_txt` / `adm2_txt` (`runners.py:626`, `runners.py:651`)¶

Format wrappers. Aggregate per-fold ADM tables across OOS years via _agg_adm_across_years (mean of MAE/MdAE/RMSE/%Err; sum of N), then call _format_adm_txt to produce a fixed-width text table with an OVERALL row computed by _adm_overall_row (N-weighted averages, runners.py:543).

`diagnostics/init.py`¶

Module-level docstring only (__init__.py:1–8). No public symbols are exported; callers import directly from metrics or runners.

Output artefacts¶

All outputs written under <run_dir>/reports/:

File	Writer	Content
`metrics_table.csv`	`_write_metrics_table_csv`	Per-fold rows (bu/ac) + trailing `mean` row
`stage5_metrics.txt`	`write_rolling_forecast_metrics_files`	National rolling MAE table (bu/ac, w19+)
`stage5_metrics_ADM1.txt`	same	State-level OOS MAE/MdAE/RMSE/%Err
`stage5_metrics_ADM2.txt`	same	County-level OOS MAE/MdAE/RMSE/%Err

Intermediate JSON is written to <run_dir>/metrics/raw_fold_metrics.json (or postprocessed_fold_metrics.json) by compute_metrics.

Unit conversion: `lib/unit_utils.py`¶

Scalar formula (unit_utils.py:33):

bu_ac = kg_ha × HA_PER_ACRE / (bushel_weight_lbs × KG_PER_LB)

Constants: HA_PER_ACRE = 0.404686, KG_PER_LB = 0.453592 (unit_utils.py:24–25). The module note acknowledges a ~0.00017 % systematic bias vs QUBE's higher-precision constants.

Vectorised variants (unit_utils.py:64, unit_utils.py:78): kg_ha_to_bu_acre_array (numpy) and kg_ha_to_bu_acre_series (pandas) apply the same scale factor. diagnostics/runners.py uses the array variant throughout ADM table construction.

Stage integration¶

stages/run_diagnostics.py:evaluate_experiment is the primary entry point called by the CLI. It delegates:

compute_metrics(run_dir) — always.
write_metrics_artefacts(run_dir) + generate_plots(run_dir) — unless skip_plots=True.

enrich_fold_metrics_for_reporting (run_diagnostics.py:51) is a thin wrapper that iterates paired (metrics_dict, reporting_context) tuples and calls add_rolling_forecast_metrics_for_reporting per fold.

Cross-references¶

lib/unit_utils.py — bu/ac conversion constants and vectorised helpers
diagnostics/plots/ — separate actor; consumes rolling_forecast_data key produced here
delivery/conversions.py — also imports aggregate_rolling_forecast_data_from_experiment (runners.py:23)
lib/results/results_slice.py — HindcastSlice / AbstractSlice fold handles
models/meta_models/bias_correction.py — AbstractBiasCorrector.load + apply_national / apply_frame
lib/reference_data/loader.py — build_loaders, build_references_by_harvest_year, ReferenceYieldLoader

Relationships¶

stages/run_diagnostics.py
  └─ evaluate_experiment()
       ├─ diagnostics/metrics.py :: compute_metrics()
       └─ diagnostics/metrics.py :: write_metrics_artefacts()
            ├─ gen_metrics()  [per fold, kg/ha]
            ├─ diagnostics/runners.py :: add_rolling_forecast_metrics_for_reporting()
            │    ├─ aggregate_rolling_forecast_data_from_experiment()
            │    └─ compute_rolling_forecast_adm_oos_tables()
            ├─ runners :: write_rolling_forecast_metrics_files()
            │    → reports/stage5_metrics*.txt
            ├─ _convert_metrics_to_bu_acre()  [lib/unit_utils.py]
            └─ _write_metrics_table_csv()
                 ├─ → reports/metrics_table.csv
                 └─ _log_metrics_table_to_mlflow()

Source: diagnostics (metrics + reports)¶

Overview¶

Modules¶

diagnostics/metrics.py¶

Public functions¶

Unit conversion boundary (metrics.py:336–354)¶

MLflow logging (metrics.py:357–380)¶

Helper: _county (metrics.py:65)¶

diagnostics/runners.py¶

Constants¶

aggregate_rolling_forecast_data_from_experiment (runners.py:55)¶

add_rolling_forecast_metrics_for_reporting (runners.py:148)¶

build_rolling_forecast_metrics_national_txt (runners.py:219)¶

compute_rolling_forecast_adm_oos_tables (runners.py:414)¶

write_rolling_forecast_metrics_files (runners.py:676)¶

build_rolling_forecast_metrics_adm1_txt / adm2_txt (runners.py:626, runners.py:651)¶

diagnostics/__init__.py¶