Skip to content

ADR-001: Walk-forward CV bypasses stages.run_predict

Status: Accepted (retroactively documented 2026-05-08) Date: 2026-04-28 (commit b79e74c9, "Phase 15 — accumulate fold rolling_preds in memory, write once"); landed on main via PR #339 on 2026-04-29 (commit 06b47bf7) Authors: ai-tommytf (per git log on runner.py)

Context

The hindcast pipeline must, per fold, score every init_date on the hindcast grid and persist the union of those predictions as a single walk_forward_preds.parquet under {preds_dir}/{experiment_key}/{fold_label}/ (wiki/commodity_hindcast/pipelines/predict.md, Outputs table). Downstream postprocess and evaluation read that one file per fold and assume every init_date is present.

The single-init compute kernel lives in stages/run_predict.py: predict() performs the four-step inverse pipeline (detrend → score → weather-correct → retrend) and returns a wide prediction frame; write_walk_forward_outputs() persists it. The writer has explicit blind-overwrite semantics — its docstring states "each call replaces any pre-existing file at the destination. Multi-init callers (walk-forward CV) must accumulate wide_prediction_frame across init_dates in memory and emit ONE writer call per fold; K sequential calls would destroy K-1 init_dates on disk" (market_insights_models/src/commodity_hindcast/stages/run_predict.py:330-333). The thin wrapper run_predict() composes the two and is therefore unsafe to call K times in sequence for the same fold (stages/run_predict.py:346-357).

A naive walk-forward implementation that called run_predict() once per (fold, init_date) pair would leave only the last ~1/K of the rows on disk and silently break every downstream consumer.

Decision

Walk-forward CV reuses the shared regression kernel (models.regression.runtime.predict) directly inside an in-memory loop rather than routing through stages.run_predict. The orchestrator run_walk_forward() (run/runner.py:27) calls run_experiment() per fold to fit, then delegates to _predict_fold_rolling() (run/runner.py:86), which:

  1. loads the just-fitted fold's artefacts via HindcastSlice.from_config (run/runner.py:101-105);
  2. preallocates pred[] and pred_detrended[] numpy arrays sized to len(year_data) (run/runner.py:107-108);
  3. iterates sorted(year_data['init_date'].unique()), calling predict_kernel(...) per slice and writing results into the preallocated arrays via positional boolean masks (run/runner.py:111-127);
  4. assembles the wide frame via build_wide_prediction_frame and returns it (run/runner.py:133-138).

run_walk_forward then issues a single write_dataframe(rolling_wide, ...) per fold (run/runner.py:74), never touching write_walk_forward_outputs. The county-major row order falls out of preserving the upstream pred.parquet order verbatim (run/runner.py:38-49).

Consequences

Positive

  • One write per fold; the K-init blind-overwrite footgun is structurally unreachable on the walk-forward path.
  • The compute kernel is shared with the single-init path (models.regression.runtime.predict), so the four-step inverse pipeline is not duplicated — only the orchestration around it is (run/runner.py:117-125 vs stages/run_predict.py:255-318).
  • County-major ordering is preserved, which the manifest oracle relies on because it hashes raw bytes (run/runner.py:46-49).

Negative

  • Two prediction entry points exist in the codebase: run_walk_forward for hindcast and run_predict for CLI predict and the forecast pipeline (stages/run_predict.py:354-357). Future readers must learn why before refactoring.
  • The bypass kernel reloads fold artefacts from disk via HindcastSlice inside the loop (run/runner.py:101-105) even though run_experiment has just produced them in the same process; this is intentional symmetry with the single-init path but adds a pickle round-trip per fold.

Neutral

  • train_preds.parquet is still emitted by run_experiment per fold (run/experiment_protocol.py:22); only the rolling/test-side parquet is bypassed.

Alternatives considered

Make write_walk_forward_outputs append-safe. Rejected. Each init_date's wide frame must coexist as one row per geography per init_date; turning the writer into a read-merge-write accumulator would introduce concurrent-writer hazards on S3 data_root deployments (append semantics on parquet are not atomic) and would silently mask the case where the same (fold, init_date) is computed twice. Keeping the writer "blind overwrite, single-shot" preserves a clean invariant and pushes the accumulation responsibility to the caller that has the in-memory data anyway.

Verification

  • The hindcast manifest oracle hashes raw walk_forward_preds.parquet bytes, so any reordering or row loss fails the gate even when the values are identical post-canonical-sort (run/runner.py:46-49).
  • Module docstring stages/run_predict.py:13-25 documents the contract the bypass enforces; the wiki entry wiki/commodity_hindcast/pipelines/predict.md (Failure modes table, row "walk_forward_preds.parquet has only one init-date after CV") records the symptom seen if the bypass is removed.
  • No dedicated unit test for run_walk_forward / _predict_fold_rolling was found via grep over /data/processing/github/treefera-market-insights/tests; regression coverage is end-to-end via the hindcast manifest only. [PLACEHOLDER: add a focused unit test that runs two folds with K>=2 init_dates each and asserts every (fold, init_date) row survives.]

References

  • market_insights_models/src/commodity_hindcast/run/runner.py:27run_walk_forward entrypoint
  • market_insights_models/src/commodity_hindcast/run/runner.py:38-49 — docstring rationale
  • market_insights_models/src/commodity_hindcast/run/runner.py:86_predict_fold_rolling kernel-reuse function
  • market_insights_models/src/commodity_hindcast/run/runner.py:107-127 — positional-mask accumulation
  • market_insights_models/src/commodity_hindcast/stages/run_predict.py:322-343write_walk_forward_outputs blind-overwrite semantics
  • market_insights_models/src/commodity_hindcast/stages/run_predict.py:346-357run_predict wrapper, "NOT used by walk-forward CV"
  • market_insights_models/src/commodity_hindcast/run/experiment_protocol.py:22run_experiment per-fold fit
  • wiki/commodity_hindcast/sources/code/orchestration.md (run/runner.py section, "Key design note")
  • wiki/commodity_hindcast/pipelines/predict.md (Step-by-step flow; Failure modes table)
  • PR #339 (commit 06b47bf7, "9-phase restructure to match SYNTHESIS"); the in-memory accumulator was introduced earlier in the same series at commit b79e74c9 ("Phase 15 — accumulate fold rolling_preds in memory, write once")