INPUT_DATA_DIR Contract¶
What it is¶
INPUT_DATA_DIR is an environment variable that every commodity_hindcast entry point
reads before any filesystem operation. It is the sole source of truth for where
the pipeline finds its input data, writes feature parquets, creates run directories,
and anchors its MLflow SQLite database. No YAML config field named data_root may be
supplied by users; no cwd-relative fallback exists.
This contract is specified in DESIGN.md as Clause 6, quoted verbatim:
"WHEN resolving the experiment data root, the system SHALL require
INPUT_DATA_DIRto be set in the environment as the single source of truth — user-facing YAML configs SHALL NOT carry adata_rootfield, and the system SHALL NOT fall back to a cwd-relative default."
Where it lives¶
| Symbol | File | Line |
|---|---|---|
require_input_data_dir() |
config.py |
50 |
ExperimentConfig.data_root default |
config.py |
643 |
validation_alias (INPUT_DATA_DIR / data_root) |
config.py |
644 |
require_input_data_dir()¶
The sole resolver. Its implementation:
def require_input_data_dir() -> AnyPath:
value = os.environ.get("INPUT_DATA_DIR", "").strip()
if not value:
raise RuntimeError(
"INPUT_DATA_DIR is not set. Export it (dev: any absolute path "
"where features/runs/mlruns may live; ECS: the container "
"mount point) before invoking the pipeline."
)
return AnyPath(value)
Key properties:
- Returns
AnyPath, so the result is anS3PathwhenINPUT_DATA_DIRis set tos3://bucket/prefix(as in the QA ECS environment). - Raises
RuntimeErrorimmediately at config-load time, not at the first filesystem call. The error message includes an actionable next step. - The empty-string check (
.strip()) guards againstexport INPUT_DATA_DIR="".
DESIGN.md Clause 6 continues:
"Every entry point (CLI, dashboard, eval shim) shall use this helper. The
validation_alias = AliasChoices("INPUT_DATA_DIR", "data_root")is retained only soconfig_resolved.yamlround-trips on reload — it is run state, not a user-facing config source."
The AliasChoices alias means that when config_resolved.yaml is re-loaded by a
downstream stage, the serialised data_root value (written by prepare_hindcast_mlflow)
is accepted as-is without the env var being set again. This is a round-trip mechanism,
not a user-facing override path.
Per-pipeline values¶
The value of INPUT_DATA_DIR differs by pipeline:
| Pipeline | Typical INPUT_DATA_DIR value |
|---|---|
crop_yield |
/data/processing/yield_forecast |
commodity_hindcast |
The repo root (e.g. /data/processing/github/treefera-market-insights) |
For commodity_hindcast, configs reference data using relative prefixes such as
data/nass/..., data/wasde/..., and data/weather/.... These resolve under the
repo root via a data symlink pointing to
treefera-market-insights-commodity-hindcast/data. Feature outputs default to
repo_root/features/ (a real directory, not the data/features symlink).
When running via make, the Makefile cd $(REPO_ROOT) before invoking the CLI, so
relative paths anchor correctly regardless of the caller's working directory.
Run directory anchoring¶
Clause 5 of DESIGN.md specifies:
"WHEN
RunRunner.run()is called, the system SHALL resolve a run directory: if--run_diris set, the existing directory is reused (resume mode); otherwise a new timestamped$INPUT_DATA_DIR/runs/YYYYMMDD_HHMMSS/directory is created."
In practice the naming includes the experiment key as a suffix:
This means that two concurrent runs of different commodities never collide, but two runs of the same commodity within the same second would (though the same-second case is extremely unlikely and would only be a problem at stage-level resume, not full pipeline).
Key invariants¶
require_input_data_dir()is the only place that readsINPUT_DATA_DIRfrom the environment. All other code receives it asconfig.data_root.- User-facing YAML files must never contain a
data_rootkey. The only legitimate source ofdata_rootis thedefault_factory=require_input_data_dironExperimentConfig.data_root(config.py:643). - If
INPUT_DATA_DIRis ans3://URI (QA environment),data_rootis aCloudPath; every path-handling call must respect the S3-path-safety rules (see s3_path_safety).
How it interacts with the pipeline¶
At config-load time (the moment any ExperimentConfig is instantiated), Pydantic
calls require_input_data_dir() as the default_factory for data_root. If the env
var is missing the exception surfaces before any stage code runs. The resolved
AnyPath becomes the anchor for all ResolvablePath fields via
_iter_resolvable_fields and _resolve_data_paths. MLflow tracking URI anchoring
(tracking_uri_anchored, lib/tracking/decorators.py:43) also reads config.data_root
to pin mlruns.db next to runs/ and features/.
Pitfalls¶
- Setting
INPUT_DATA_DIRto a path that ends with a trailing slash does not cause a bug (AnyPath normalises it), but it produces double-slash run dirs in log messages. - In development, accidentally running a script from outside the repo root without
exporting
INPUT_DATA_DIRproduces an immediateRuntimeError. This is by design. - In ECS, if the volume is mounted at a different path from local dev, feature parquets
and run dirs land at different absolute paths.
config_resolved.yamlrecords the actual resolved path, so it is always self-describing even if the mount point changes between environments.
Related entities and concepts¶
- RunDir — created under
INPUT_DATA_DIR/runs/ - s3_path_safety — how S3 URIs are handled when
INPUT_DATA_DIR=s3://... - DESIGN.md — Clause 6 is the canonical requirement
Open questions¶
- The
AliasChoices("INPUT_DATA_DIR", "data_root")alias onExperimentConfig.data_rootmeans that settingdata_root=...in a YAML config technically works (pydantic accepts it). The design intent is that this path is only used byconfig_resolved.yamlround-trips; there is no runtime guard preventing a user from abusing it to set an arbitrary data root via YAML. - The
_apply_input_data_dir_as_data_root_for_run_allpath in README.md suggests there is a separate override path forrun all; its interaction withrequire_input_data_diris not fully documented in DESIGN.md.