RunDir¶
Definition¶
RunDir is the filesystem directory that serves as the sole cross-stage hand-off contract in the commodity_hindcast pipeline. Every pipeline stage reads its inputs from, and writes its outputs to, a single run_dir path on disk; no in-memory objects cross a stage boundary (DESIGN.md Clause 34). The directory is created once at the start of run_hindcast.run() and never relocated.
Kind: Persistence aggregate root. Not a Python class — run_dir is a Path | CloudPath field on ExperimentResult and on every slice. The owning aggregate is ExperimentResult.
Source of truth: market_insights_models/src/commodity_hindcast/stages/run_hindcast.py:77–98 (_create_run_root).
Naming convention¶
The timestamp is utcnow().strftime("%Y%m%d_%H%M%S") (run_hindcast.py:82). experiment_key is CommodityConfig.experiment_key (e.g. corn_usa, soybeans_bra). Example:
The naming scheme guarantees uniqueness across concurrent runs of the same commodity and is human-readable without tool support.
Canonical directory layout¶
The README.md is the authoritative description of the on-disk layout (source: sources/docs/README.md):
run_dir/
├── config_resolved.yaml
├── models/<experiment_key>/<fold|production>/
│ ├── detrender.pkl
│ ├── feature_fill_values.parquet
│ └── model.*
├── preds/<experiment_key>/<fold>/
│ ├── train_preds.parquet
│ ├── walk_forward_preds.parquet
│ └── year_data.parquet
├── postprocessed/<experiment_key>_national.parquet
├── reports/
└── delivery/Treefera_*_Hindcast_*.csv
After cli run forecast the tree extends with the per-(season_year, init_date) subtree introduced in PR #369:
run_dir/forecast/<season_year>/<init_date>/
├── indices.zarr
├── features/pred.parquet
├── preds/walk_forward_preds.parquet
├── postprocessed/national.parquet
└── delivery/Treefera_*_Forecast_*.csv
Additionally, conformal calibration sidecars (added in PR #361) land at:
and the county selection file at:
Key attributes¶
| Path | Written by | Consumed by |
|---|---|---|
config_resolved.yaml |
_create_run_root (run_hindcast.py:91) |
All stages (via _load_config) |
included_geo_identifiers.txt |
FIT phase (run_hindcast.py:218) |
PREDICT, POSTPROCESS |
models/{key}/{fold}/ |
FIT stage (run_fit.py) |
PREDICT, POSTPROCESS |
preds/{key}/{fold}/walk_forward_preds.parquet |
PREDICT stage (run_predict.py) |
POSTPROCESS, DELIVER, EVALUATE |
conformal/{mode}.parquet |
POSTPROCESS stage (run_meta_models.py:67) |
FORECAST, DELIVER |
postprocessed/national.parquet |
POSTPROCESS stage | DELIVER, EVALUATE |
delivery/Treefera_*.csv |
DELIVER stage (run_deliver.py) |
Client |
forecast/{season_year}/{init_date}/ |
FORECAST pipeline (run_forecast.py) |
Client |
reports/ |
EVALUATE stage (run_diagnostics.py) |
MLflow, Dashboard |
_create_run_root — the creation function¶
The run root is created in stages/run_hindcast.py:77–98, not in run/runner.py as the orchestrator seed prompt incorrectly stated. The correction is documented in wiki/sources/code/orchestration.md.
def _create_run_root(config: ExperimentConfig) -> tuple[AnyPath, str]:
stamp = utcnow().strftime("%Y%m%d_%H%M%S")
run_root = config.run_dir_base / f"{stamp}_{config.commodity.experiment_key}"
run_root.mkdir(parents=True, exist_ok=True)
config.models_dir = run_root / "models"
config.preds_dir = run_root / "preds"
# ... writes config_resolved.yaml
return run_root, stamp
config.run_dir_base resolves from INPUT_DATA_DIR/runs/ (env var anchoring — DESIGN.md Clause 6).
Lifecycle¶
Created: By _create_run_root at the start of run_hindcast.run() (run_hindcast.py:203), or implicitly on the first ForecastSlice write during cli run forecast --run-dir D.
Populated (stage by stage):
1. config_resolved.yaml — written at creation; never overwritten (resume mode writes config_train.yaml to avoid clobber).
2. FIT phase — fills models/ and preds/ subtrees per fold, writes included_geo_identifiers.txt.
3. POSTPROCESS — fills conformal/, postprocessed/, per-fold bias_corrector.pkl.
4. EVALUATE — fills reports/.
5. DELIVER — fills delivery/.
6. FORECAST — fills forecast/{season_year}/{init_date}/ per invocation.
Consumed: All downstream stages address each other exclusively via run_dir. ExperimentResult.from_run_dir(run_dir) reconstructs the full domain context from the path alone (run_result.py:40).
Destroyed / archived: No automatic cleanup. Run directories accumulate; operators prune by timestamp. The delivery/export.py stage optionally copies delivery CSVs to MODEL_INGESTION_PATH.
Relationships¶
| Relationship | Entity | Notes |
|---|---|---|
| Owned by | ExperimentResult |
run_dir is the primary identity field |
| Contains | HindcastSlice × N |
One per CV fold + one production fold |
| Contains | ForecastSlice × M |
One per (season_year, init_date) |
| Contains | CalibrationResult × K |
One parquet sidecar per residual_mode |
| Anchored by | ExperimentConfig |
config_resolved.yaml snapshot; loaded lazily via _load_config |
Concepts and pipelines (forward refs to P5)¶
- Concept: Walk-forward CV — fold subdirectory naming
- Concept: S3 path safety —
run_dirmay be ans3://URI; never wrap inpathlib.Path - Concept: Stage isolation — DESIGN.md Clause 34
- Pipeline: Hindcast pipeline
- Pipeline: Forecast pipeline
PRs and commits¶
| PR / commit | Relevance |
|---|---|
| PR-339 | 9-phase restructure that established stages/run_hindcast.py as the sole owner of _create_run_root; deleted main.py and forecast/ |
| PR-361 | Added conformal/{mode}.parquet sidecars under run_dir; dropped commodity_ prefix from postprocessed/national.parquet |
| PR-369 | Restructured forecast sub-tree from forecast/{init_date}/ to forecast/{season_year}/{init_date}/ |
Open questions¶
- The DESIGN.md Clause 5 mentions
YYYYMMDD_HHMMSSwithout theexperiment_keysuffix; the actual implementation includes the key. The design doc should be updated. - Resume mode (
--run-dirset oncli run hindcast) is described in DESIGN.md but the currentrun_hindcast.run()always creates a fresh directory; resume is handled only at the stage level (e.g.,cli run forecast --run-dir D). - There is no automated cleanup or retention policy for stale
run_dirs; long-running ECS deployments accumulate directories on the attached volume.