ADR-001: Walk-forward CV bypasses stages.run_predict¶
Status: Accepted (retroactively documented 2026-05-08)
Date: 2026-04-28 (commit b79e74c9, "Phase 15 — accumulate fold rolling_preds in memory, write once"); landed on main via PR #339 on 2026-04-29 (commit 06b47bf7)
Authors: ai-tommytf (per git log on runner.py)
Context¶
The hindcast pipeline must, per fold, score every init_date on the
hindcast grid and persist the union of those predictions as a single
walk_forward_preds.parquet under
{preds_dir}/{experiment_key}/{fold_label}/
(wiki/commodity_hindcast/pipelines/predict.md, Outputs table). Downstream
postprocess and evaluation read that one file per fold and assume every
init_date is present.
The single-init compute kernel lives in
stages/run_predict.py: predict() performs the four-step inverse
pipeline (detrend → score → weather-correct → retrend) and returns a wide
prediction frame; write_walk_forward_outputs() persists it. The writer
has explicit blind-overwrite semantics — its docstring states
"each call replaces any pre-existing file at the destination. Multi-init
callers (walk-forward CV) must accumulate wide_prediction_frame across
init_dates in memory and emit ONE writer call per fold; K sequential
calls would destroy K-1 init_dates on disk"
(market_insights_models/src/commodity_hindcast/stages/run_predict.py:330-333).
The thin wrapper run_predict() composes the two and is therefore
unsafe to call K times in sequence for the same fold
(stages/run_predict.py:346-357).
A naive walk-forward implementation that called run_predict() once per
(fold, init_date) pair would leave only the last ~1/K of the rows on
disk and silently break every downstream consumer.
Decision¶
Walk-forward CV reuses the shared regression kernel
(models.regression.runtime.predict) directly inside an in-memory loop
rather than routing through stages.run_predict. The orchestrator
run_walk_forward() (run/runner.py:27) calls run_experiment() per
fold to fit, then delegates to _predict_fold_rolling()
(run/runner.py:86), which:
- loads the just-fitted fold's artefacts via
HindcastSlice.from_config(run/runner.py:101-105); - preallocates
pred[]andpred_detrended[]numpy arrays sized tolen(year_data)(run/runner.py:107-108); - iterates
sorted(year_data['init_date'].unique()), callingpredict_kernel(...)per slice and writing results into the preallocated arrays via positional boolean masks (run/runner.py:111-127); - assembles the wide frame via
build_wide_prediction_frameand returns it (run/runner.py:133-138).
run_walk_forward then issues a single
write_dataframe(rolling_wide, ...) per fold (run/runner.py:74), never
touching write_walk_forward_outputs. The county-major row order falls
out of preserving the upstream pred.parquet order verbatim
(run/runner.py:38-49).
Consequences¶
Positive¶
- One write per fold; the K-init blind-overwrite footgun is structurally unreachable on the walk-forward path.
- The compute kernel is shared with the single-init path
(
models.regression.runtime.predict), so the four-step inverse pipeline is not duplicated — only the orchestration around it is (run/runner.py:117-125vsstages/run_predict.py:255-318). - County-major ordering is preserved, which the manifest oracle relies
on because it hashes raw bytes (
run/runner.py:46-49).
Negative¶
- Two prediction entry points exist in the codebase:
run_walk_forwardfor hindcast andrun_predictfor CLIpredictand the forecast pipeline (stages/run_predict.py:354-357). Future readers must learn why before refactoring. - The bypass kernel reloads fold artefacts from disk via
HindcastSliceinside the loop (run/runner.py:101-105) even thoughrun_experimenthas just produced them in the same process; this is intentional symmetry with the single-init path but adds a pickle round-trip per fold.
Neutral¶
train_preds.parquetis still emitted byrun_experimentper fold (run/experiment_protocol.py:22); only the rolling/test-side parquet is bypassed.
Alternatives considered¶
Make write_walk_forward_outputs append-safe. Rejected. Each
init_date's wide frame must coexist as one row per geography per
init_date; turning the writer into a read-merge-write accumulator would
introduce concurrent-writer hazards on S3 data_root deployments
(append semantics on parquet are not atomic) and would silently mask
the case where the same (fold, init_date) is computed twice. Keeping
the writer "blind overwrite, single-shot" preserves a clean invariant
and pushes the accumulation responsibility to the caller that has the
in-memory data anyway.
Verification¶
- The hindcast manifest oracle hashes raw
walk_forward_preds.parquetbytes, so any reordering or row loss fails the gate even when the values are identical post-canonical-sort (run/runner.py:46-49). - Module docstring
stages/run_predict.py:13-25documents the contract the bypass enforces; the wiki entrywiki/commodity_hindcast/pipelines/predict.md(Failure modes table, row "walk_forward_preds.parquet has only one init-date after CV") records the symptom seen if the bypass is removed. - No dedicated unit test for
run_walk_forward/_predict_fold_rollingwas found viagrepover/data/processing/github/treefera-market-insights/tests; regression coverage is end-to-end via the hindcast manifest only. [PLACEHOLDER: add a focused unit test that runs two folds with K>=2 init_dates each and asserts every(fold, init_date)row survives.]
References¶
market_insights_models/src/commodity_hindcast/run/runner.py:27—run_walk_forwardentrypointmarket_insights_models/src/commodity_hindcast/run/runner.py:38-49— docstring rationalemarket_insights_models/src/commodity_hindcast/run/runner.py:86—_predict_fold_rollingkernel-reuse functionmarket_insights_models/src/commodity_hindcast/run/runner.py:107-127— positional-mask accumulationmarket_insights_models/src/commodity_hindcast/stages/run_predict.py:322-343—write_walk_forward_outputsblind-overwrite semanticsmarket_insights_models/src/commodity_hindcast/stages/run_predict.py:346-357—run_predictwrapper, "NOT used by walk-forward CV"market_insights_models/src/commodity_hindcast/run/experiment_protocol.py:22—run_experimentper-fold fitwiki/commodity_hindcast/sources/code/orchestration.md(run/runner.pysection, "Key design note")wiki/commodity_hindcast/pipelines/predict.md(Step-by-step flow; Failure modes table)- PR #339 (commit
06b47bf7, "9-phase restructure to match SYNTHESIS"); the in-memory accumulator was introduced earlier in the same series at commitb79e74c9("Phase 15 — accumulate fold rolling_preds in memory, write once")