RunDir¶

Definition¶

RunDir is the filesystem directory that serves as the sole cross-stage hand-off contract in the commodity_hindcast pipeline. Every pipeline stage reads its inputs from, and writes its outputs to, a single run_dir path on disk; no in-memory objects cross a stage boundary (DESIGN.md Clause 34). The directory is created once at the start of run_hindcast.run() and never relocated.

Kind: Persistence aggregate root. Not a Python class — run_dir is a Path | CloudPath field on ExperimentResult and on every slice. The owning aggregate is ExperimentResult.

Source of truth: market_insights_models/src/commodity_hindcast/stages/run_hindcast.py:77–98 (_create_run_root).

Naming convention¶

<INPUT_DATA_DIR>/runs/<YYYYMMDD_HHMMSS>_<experiment_key>/

The timestamp is utcnow().strftime("%Y%m%d_%H%M%S") (run_hindcast.py:82). experiment_key is CommodityConfig.experiment_key (e.g. corn_usa, soybeans_bra). Example:

/data/processing/.../runs/20260505_143022_corn_usa/

The naming scheme guarantees uniqueness across concurrent runs of the same commodity and is human-readable without tool support.

Canonical directory layout¶

The README.md is the authoritative description of the on-disk layout (source: sources/docs/README.md):

run_dir/
├── config_resolved.yaml
├── models/<experiment_key>/<fold|production>/
│     ├── detrender.pkl
│     ├── feature_fill_values.parquet
│     └── model.*
├── preds/<experiment_key>/<fold>/
│     ├── train_preds.parquet
│     ├── walk_forward_preds.parquet
│     └── year_data.parquet
├── postprocessed/<experiment_key>_national.parquet
├── reports/
└── delivery/Treefera_*_Hindcast_*.csv

After cli run forecast the tree extends with the per-(season_year, init_date) subtree introduced in PR #369:

run_dir/forecast/<season_year>/<init_date>/
├── indices.zarr
├── features/pred.parquet
├── preds/walk_forward_preds.parquet
├── postprocessed/national.parquet
└── delivery/Treefera_*_Forecast_*.csv

Additionally, conformal calibration sidecars (added in PR #361) land at:

run_dir/conformal/<residual_mode>.parquet

and the county selection file at:

run_dir/included_geo_identifiers.txt

Key attributes¶

Path	Written by	Consumed by
`config_resolved.yaml`	`_create_run_root` (`run_hindcast.py:91`)	All stages (via `_load_config`)
`included_geo_identifiers.txt`	FIT phase (`run_hindcast.py:218`)	PREDICT, POSTPROCESS
`models/{key}/{fold}/`	FIT stage (`run_fit.py`)	PREDICT, POSTPROCESS
`preds/{key}/{fold}/walk_forward_preds.parquet`	PREDICT stage (`run_predict.py`)	POSTPROCESS, DELIVER, EVALUATE
`conformal/{mode}.parquet`	POSTPROCESS stage (`run_meta_models.py:67`)	FORECAST, DELIVER
`postprocessed/national.parquet`	POSTPROCESS stage	DELIVER, EVALUATE
`delivery/Treefera_*.csv`	DELIVER stage (`run_deliver.py`)	Client
`forecast/{season_year}/{init_date}/`	FORECAST pipeline (`run_forecast.py`)	Client
`reports/`	EVALUATE stage (`run_diagnostics.py`)	MLflow, Dashboard

`_create_run_root` — the creation function¶

The run root is created in stages/run_hindcast.py:77–98, not in run/runner.py as the orchestrator seed prompt incorrectly stated. The correction is documented in wiki/sources/code/orchestration.md.

def _create_run_root(config: ExperimentConfig) -> tuple[AnyPath, str]:
    stamp = utcnow().strftime("%Y%m%d_%H%M%S")
    run_root = config.run_dir_base / f"{stamp}_{config.commodity.experiment_key}"
    run_root.mkdir(parents=True, exist_ok=True)
    config.models_dir = run_root / "models"
    config.preds_dir = run_root / "preds"
    # ... writes config_resolved.yaml
    return run_root, stamp

config.run_dir_base resolves from INPUT_DATA_DIR/runs/ (env var anchoring — DESIGN.md Clause 6).

Lifecycle¶

Created: By _create_run_root at the start of run_hindcast.run() (run_hindcast.py:203), or implicitly on the first ForecastSlice write during cli run forecast --run-dir D.

Populated (stage by stage): 1. config_resolved.yaml — written at creation; never overwritten (resume mode writes config_train.yaml to avoid clobber). 2. FIT phase — fills models/ and preds/ subtrees per fold, writes included_geo_identifiers.txt. 3. POSTPROCESS — fills conformal/, postprocessed/, per-fold bias_corrector.pkl. 4. EVALUATE — fills reports/. 5. DELIVER — fills delivery/. 6. FORECAST — fills forecast/{season_year}/{init_date}/ per invocation.

Consumed: All downstream stages address each other exclusively via run_dir. ExperimentResult.from_run_dir(run_dir) reconstructs the full domain context from the path alone (run_result.py:40).

Destroyed / archived: No automatic cleanup. Run directories accumulate; operators prune by timestamp. The delivery/export.py stage optionally copies delivery CSVs to MODEL_INGESTION_PATH.

Relationships¶

Relationship	Entity	Notes
Owned by	`ExperimentResult`	`run_dir` is the primary identity field
Contains	`HindcastSlice` × N	One per CV fold + one production fold
Contains	`ForecastSlice` × M	One per `(season_year, init_date)`
Contains	`CalibrationResult` × K	One parquet sidecar per `residual_mode`
Anchored by	`ExperimentConfig`	`config_resolved.yaml` snapshot; loaded lazily via `_load_config`

Concepts and pipelines (forward refs to P5)¶

Concept: Walk-forward CV — fold subdirectory naming
Concept: S3 path safety — run_dir may be an s3:// URI; never wrap in pathlib.Path
Concept: Stage isolation — DESIGN.md Clause 34
Pipeline: Hindcast pipeline
Pipeline: Forecast pipeline

PRs and commits¶

PR / commit	Relevance
PR-339	9-phase restructure that established `stages/run_hindcast.py` as the sole owner of `_create_run_root`; deleted `main.py` and `forecast/`
PR-361	Added `conformal/{mode}.parquet` sidecars under `run_dir`; dropped `commodity_` prefix from `postprocessed/national.parquet`
PR-369	Restructured forecast sub-tree from `forecast/{init_date}/` to `forecast/{season_year}/{init_date}/`

Open questions¶

The DESIGN.md Clause 5 mentions YYYYMMDD_HHMMSS without the experiment_key suffix; the actual implementation includes the key. The design doc should be updated.
Resume mode (--run-dir set on cli run hindcast) is described in DESIGN.md but the current run_hindcast.run() always creates a fresh directory; resume is handled only at the stage level (e.g., cli run forecast --run-dir D).
There is no automated cleanup or retention policy for stale run_dirs; long-running ECS deployments accumulate directories on the attached volume.