Source: diagnostics (metrics + reports)¶
Overview¶
The diagnostics/ package is a pure consumer of run-directory artefacts. It never re-invokes models, never re-fits detrenders, and never re-imputes data. Its sole job is to load saved predictions from disk and produce human-readable reports plus JSON metric caches.
The subsystem is split into two modules (plots are covered separately):
| Module | Lines | Responsibility |
|---|---|---|
diagnostics/metrics.py |
~464 | Per-fold benchmark scoring, metric conversion, metrics_table.csv, MLflow logging |
diagnostics/runners.py |
~697 | Rolling national vs reference text reports; ADM1/ADM2 OOS error tables |
diagnostics/__init__.py |
~55 | Package docstring only — no public re-exports |
All outputs land under <run_dir>/reports/.
Modules¶
diagnostics/metrics.py¶
Public functions¶
gen_metrics(test_data, cfg, *, fold, nass_obs=None) → dict[str, float] (metrics.py:109)
Core per-fold scorer. Called once per hindcast fold. Accepts one fold's test rows (post inverse-transform) and returns a flat dict of error scalars. All internal comparisons are in kg/ha; bu/ac conversion is deferred to _convert_metrics_to_bu_acre at the metrics-table boundary.
Steps performed:
- Area-weighted national aggregation of
sim_yield_kg_ha→sim_nat_kg_ha. - Load the fold's persisted
AbstractBiasCorrectorviaAbstractBiasCorrector.load(fold.bias_corrector_path)and callapply_national(sim_nat_kg_ha)→sim_nat_adj_kg_ha(metrics.py:156–157). - For each
ReferenceYieldLoaderinbuild_loaders(cfg): fetchyield_final(year)(kg/ha), compute|ref − sim_nat_adj|, store asf"{spec.name}_mae"andf"{spec.name}_final_kg_ha"(metrics.py:165–177). - NASS national prod/area MAE:
|nass_national_prod_div_area_kg_ha − sim_nat_adj|(metrics.py:181–189). - NASS county prod/area MAE + RMSE: county-level join on
geo_identifier(metrics.py:193–204). - NASS county survey yield MAE + RMSE: same join pattern using
target_col(metrics.py:207–217). - NASS national survey yield MAE: area-weighted scalar from
nass_national_survey_yield_area_weighted_kg_ha(metrics.py:220–230).
Metric keys emitted (all in kg/ha unless labelled _bu_ac_obs):
| Key | Description |
|---|---|
bias_correction_kg_ha |
AbstractBiasCorrector.bias_kg_ha scalar |
{spec.name}_mae |
|ref_final_kg_ha − sim_nat_adj_kg_ha| per reference spec |
{spec.name}_final_kg_ha |
Reference final yield (renamed to _bu_ac after conversion) |
nass_national_prod_div_area_mae |
National prod/area absolute error (kg/ha) |
nass_national_prod_div_area_bu_ac_obs |
NASS national prod/area obs converted to bu/ac |
nass_county_prod_div_area_mae |
County-level prod/area MAE (kg/ha) |
nass_county_prod_div_area_rmse |
County-level prod/area RMSE (kg/ha) |
nass_county_survey_yield_mae |
County-level survey yield MAE (kg/ha) |
nass_county_survey_yield_rmse |
County-level survey yield RMSE (kg/ha) |
nass_national_survey_yield_mae |
National survey yield absolute error (kg/ha) |
nass_national_survey_bu_ac_obs |
NASS national survey obs in bu/ac |
nass_national_survey_yield_area_weighted_bu_ac_obs |
Alias for above |
compute_metrics(run_dir, *, postprocessed=False) → list[dict[str, float]] (metrics.py:239)
Orchestrator: loads ExperimentResult from disk, iterates over all hindcast_slices, calls gen_metrics per fold, appends a "label" key (= fold.fold_label), and persists the list to <run_dir>/metrics/raw_fold_metrics.json (or postprocessed_fold_metrics.json when postprocessed=True). Does not trigger bias re-fitting; reads the already-persisted corrector (metrics.py:256–288).
load_metrics(run_dir, *, postprocessed=False) → list[dict[str, float]] (metrics.py:292)
Read-only companion: deserialises the JSON written by compute_metrics. Raises FileNotFoundError if the JSON is absent.
write_metrics_artefacts(run_dir) → None (metrics.py:407)
High-level orchestrator called by stages/run_diagnostics.py:45. For each fold:
- Calls
gen_metrics(raw kg/ha). - Calls
add_rolling_forecast_metrics_for_reporting(fromrunners.py) to attachrolling_forecast_data,rolling_forecast_adm1_oos,rolling_forecast_adm2_oosto the fold's dict.
Then:
- Calls
write_rolling_forecast_metrics_files(raw, pre-conversion) →stage5_metrics.txt,stage5_metrics_ADM1.txt,stage5_metrics_ADM2.txt. - Calls
_convert_metrics_to_bu_acrein place. - Calls
_write_metrics_table_csv→metrics_table.csvwith per-fold rows + a trailingmeanrow. - Calls
_log_metrics_table_to_mlflowwhen an MLflow run is active.
Unit conversion boundary (metrics.py:336–354)¶
_convert_metrics_to_bu_acre iterates over every float in each fold's metric dict. Keys ending in _kg_ha are renamed to _bu_ac and converted using kg_ha_to_bu_acre(v, bw) from lib/unit_utils.py:33. Keys already in bu/ac (listed in _METRIC_KEYS_ALREADY_BU_ACRE, metrics.py:310) are skipped to avoid double-conversion. The bushel_weight_lbs constant comes from ExperimentConfig.commodity.bushel_weight_lbs.
MLflow logging (metrics.py:357–380)¶
_log_metrics_table_to_mlflow logs every cell of metrics_table.csv as a flat {row_label}/{col_name}: float MLflow metric when has_active_run() returns True. Reference-level columns ({spec.name}_final_bu_ac, NASS obs constants, bias_correction_bu_ac) are excluded via _mlflow_excluded_metric_columns (metrics.py:318), because they are reference levels not error metrics.
Helper: _county (metrics.py:65)¶
NaN-safe MAE + RMSE over county distributions. Filters to np.isfinite(obs) & np.isfinite(sim) before calling sklearn.metrics.mean_absolute_error / root_mean_squared_error.
diagnostics/runners.py¶
Follows the DESIGN.md §71 rule: "reporting functions SHALL FOLLOW READ-then-PLOT — never predict()." All functions load walk_forward_preds.parquet via HindcastSlice.load_walk_forward_preds().
Constants¶
METRICS_KEY_ADM1_OOS = "rolling_forecast_adm1_oos" # runners.py:40
METRICS_KEY_ADM2_OOS = "rolling_forecast_adm2_oos" # runners.py:41
ARTIFACT_NATIONAL = "stage5_metrics.txt" # runners.py:43
ARTIFACT_ADM1 = "stage5_metrics_ADM1.txt" # runners.py:44
ARTIFACT_ADM2 = "stage5_metrics_ADM2.txt" # runners.py:45
_METRICS_MIN_WEEK = 19 # runners.py:208
_METRICS_MIN_WEEK = 19 filters out ISO weeks 1–18, which precede the first WASDE estimate of the crop year (~May). Rows with NaN reference-as-of values are kept in the raw data frame (so plots remain complete for cross-year crops like wheat) but excluded from the metrics tables.
aggregate_rolling_forecast_data_from_experiment (runners.py:55)¶
Loads the fold's wide rolling predictions, filters to scoreable rows (obs present, area > 0, _scoreable_yield_rows at runners.py:48), area-weights to national per init_date, and adds one f"{spec.name}_yield_kg_ha_asof_before_init" column per reference spec. Reference values are the latest release strictly before each init_date (see yield_asof_array_from_releases, runners.py:114). Returns a DataFrame indexed by init_date. Reference columns may carry NaN when no release exists yet.
Consumer responsibility (runners.py:76–77): any scorer that compares Treefera vs a reference MUST skip NaN reference rows explicitly.
add_rolling_forecast_metrics_for_reporting (runners.py:148)¶
Populates three keys on the passed metrics dict in place:
rolling_forecast_data— bias-corrected national rolling DataFrame (appliesAbstractBiasCorrector.apply_frametoarea_weighted_mean_forecast_yield).rolling_forecast_adm1_oos— ADM1 per-init OOS table.rolling_forecast_adm2_oos— ADM2 per-init OOS table.
ADM table generation is wrapped in a try/except; failures are logged as warnings rather than propagated (runners.py:164).
build_rolling_forecast_metrics_national_txt (runners.py:219)¶
Produces stage5_metrics.txt. One row per ISO calendar week (w19–w53). For each week and each OOS year, all init_date rows in that week are mean-averaged to yield a single error value; those yearly errors are then averaged across OOS years. Columns (all in bu/ac):
Model vs_<ACTUALS>— mean|fore_bu − nass_bu|Model vs_<REF>— mean|fore_bu − ref_final_bu|<REF> vs_<ACTUALS>— mean|ref_bu − nass_bu|<REF> vs_<REF>— mean|ref_bu − ref_final_bu|Improv%—(REF_vs_NASS − Model_vs_NASS) / REF_vs_NASS × 100Win— count of OOS years where weekly-mean|Model−NASS| < |REF−NASS|, denominator = total OOS years with valid NASS data (constant across rows).
<ACTUALS> = config.commodity.actuals_source_short; <REF> = first cfg.reference_data spec name (e.g. wasde, conab_final).
compute_rolling_forecast_adm_oos_tables (runners.py:414)¶
Returns a tuple (adm1_df, adm2_df). For each init_date in the fold:
- ADM2 — raw county-level errors, bu/ac. Columns:
fold(ISO week label),MAE,MdAE,RMSE,%Err,N(counties × years). - ADM1 — state-level area-weighted aggregation via
aggregate_weighted_frame(..., level="ADM1"), then same error columns.N= states × years.
Uses kg_ha_to_bu_acre_array from lib/unit_utils.py:64 for vectorised conversion.
write_rolling_forecast_metrics_files (runners.py:676)¶
Writes all three text artefacts under output_dir using AnyPath (S3-safe). Returns a tuple of three paths. Calls build_rolling_forecast_metrics_national_txt, build_rolling_forecast_metrics_adm1_txt, build_rolling_forecast_metrics_adm2_txt in sequence.
build_rolling_forecast_metrics_adm1_txt / adm2_txt (runners.py:626, runners.py:651)¶
Format wrappers. Aggregate per-fold ADM tables across OOS years via _agg_adm_across_years (mean of MAE/MdAE/RMSE/%Err; sum of N), then call _format_adm_txt to produce a fixed-width text table with an OVERALL row computed by _adm_overall_row (N-weighted averages, runners.py:543).
diagnostics/__init__.py¶
Module-level docstring only (__init__.py:1–8). No public symbols are exported; callers import directly from metrics or runners.
Output artefacts¶
All outputs written under <run_dir>/reports/:
| File | Writer | Content |
|---|---|---|
metrics_table.csv |
_write_metrics_table_csv |
Per-fold rows (bu/ac) + trailing mean row |
stage5_metrics.txt |
write_rolling_forecast_metrics_files |
National rolling MAE table (bu/ac, w19+) |
stage5_metrics_ADM1.txt |
same | State-level OOS MAE/MdAE/RMSE/%Err |
stage5_metrics_ADM2.txt |
same | County-level OOS MAE/MdAE/RMSE/%Err |
Intermediate JSON is written to <run_dir>/metrics/raw_fold_metrics.json (or postprocessed_fold_metrics.json) by compute_metrics.
Unit conversion: lib/unit_utils.py¶
Scalar formula (unit_utils.py:33):
Constants: HA_PER_ACRE = 0.404686, KG_PER_LB = 0.453592 (unit_utils.py:24–25). The module note acknowledges a ~0.00017 % systematic bias vs QUBE's higher-precision constants.
Vectorised variants (unit_utils.py:64, unit_utils.py:78): kg_ha_to_bu_acre_array (numpy) and kg_ha_to_bu_acre_series (pandas) apply the same scale factor. diagnostics/runners.py uses the array variant throughout ADM table construction.
Stage integration¶
stages/run_diagnostics.py:evaluate_experiment is the primary entry point called by the CLI. It delegates:
compute_metrics(run_dir)— always.write_metrics_artefacts(run_dir)+generate_plots(run_dir)— unlessskip_plots=True.
enrich_fold_metrics_for_reporting (run_diagnostics.py:51) is a thin wrapper that iterates paired (metrics_dict, reporting_context) tuples and calls add_rolling_forecast_metrics_for_reporting per fold.
Cross-references¶
lib/unit_utils.py— bu/ac conversion constants and vectorised helpersdiagnostics/plots/— separate actor; consumesrolling_forecast_datakey produced heredelivery/conversions.py— also importsaggregate_rolling_forecast_data_from_experiment(runners.py:23)lib/results/results_slice.py—HindcastSlice/AbstractSlicefold handlesmodels/meta_models/bias_correction.py—AbstractBiasCorrector.load+apply_national/apply_framelib/reference_data/loader.py—build_loaders,build_references_by_harvest_year,ReferenceYieldLoader
Relationships¶
stages/run_diagnostics.py
└─ evaluate_experiment()
├─ diagnostics/metrics.py :: compute_metrics()
└─ diagnostics/metrics.py :: write_metrics_artefacts()
├─ gen_metrics() [per fold, kg/ha]
├─ diagnostics/runners.py :: add_rolling_forecast_metrics_for_reporting()
│ ├─ aggregate_rolling_forecast_data_from_experiment()
│ └─ compute_rolling_forecast_adm_oos_tables()
├─ runners :: write_rolling_forecast_metrics_files()
│ → reports/stage5_metrics*.txt
├─ _convert_metrics_to_bu_acre() [lib/unit_utils.py]
└─ _write_metrics_table_csv()
├─ → reports/metrics_table.csv
└─ _log_metrics_table_to_mlflow()