Skip to content

INPUT_DATA_DIR Contract

What it is

INPUT_DATA_DIR is an environment variable that every commodity_hindcast entry point reads before any filesystem operation. It is the sole source of truth for where the pipeline finds its input data, writes feature parquets, creates run directories, and anchors its MLflow SQLite database. No YAML config field named data_root may be supplied by users; no cwd-relative fallback exists.

This contract is specified in DESIGN.md as Clause 6, quoted verbatim:

"WHEN resolving the experiment data root, the system SHALL require INPUT_DATA_DIR to be set in the environment as the single source of truth — user-facing YAML configs SHALL NOT carry a data_root field, and the system SHALL NOT fall back to a cwd-relative default."

Where it lives

Symbol File Line
require_input_data_dir() config.py 50
ExperimentConfig.data_root default config.py 643
validation_alias (INPUT_DATA_DIR / data_root) config.py 644

require_input_data_dir()

The sole resolver. Its implementation:

def require_input_data_dir() -> AnyPath:
    value = os.environ.get("INPUT_DATA_DIR", "").strip()
    if not value:
        raise RuntimeError(
            "INPUT_DATA_DIR is not set. Export it (dev: any absolute path "
            "where features/runs/mlruns may live; ECS: the container "
            "mount point) before invoking the pipeline."
        )
    return AnyPath(value)

Key properties:

  • Returns AnyPath, so the result is an S3Path when INPUT_DATA_DIR is set to s3://bucket/prefix (as in the QA ECS environment).
  • Raises RuntimeError immediately at config-load time, not at the first filesystem call. The error message includes an actionable next step.
  • The empty-string check (.strip()) guards against export INPUT_DATA_DIR="".

DESIGN.md Clause 6 continues:

"Every entry point (CLI, dashboard, eval shim) shall use this helper. The validation_alias = AliasChoices("INPUT_DATA_DIR", "data_root") is retained only so config_resolved.yaml round-trips on reload — it is run state, not a user-facing config source."

The AliasChoices alias means that when config_resolved.yaml is re-loaded by a downstream stage, the serialised data_root value (written by prepare_hindcast_mlflow) is accepted as-is without the env var being set again. This is a round-trip mechanism, not a user-facing override path.

Per-pipeline values

The value of INPUT_DATA_DIR differs by pipeline:

Pipeline Typical INPUT_DATA_DIR value
crop_yield /data/processing/yield_forecast
commodity_hindcast The repo root (e.g. /data/processing/github/treefera-market-insights)

For commodity_hindcast, configs reference data using relative prefixes such as data/nass/..., data/wasde/..., and data/weather/.... These resolve under the repo root via a data symlink pointing to treefera-market-insights-commodity-hindcast/data. Feature outputs default to repo_root/features/ (a real directory, not the data/features symlink).

When running via make, the Makefile cd $(REPO_ROOT) before invoking the CLI, so relative paths anchor correctly regardless of the caller's working directory.

Run directory anchoring

Clause 5 of DESIGN.md specifies:

"WHEN RunRunner.run() is called, the system SHALL resolve a run directory: if --run_dir is set, the existing directory is reused (resume mode); otherwise a new timestamped $INPUT_DATA_DIR/runs/YYYYMMDD_HHMMSS/ directory is created."

In practice the naming includes the experiment key as a suffix:

<INPUT_DATA_DIR>/runs/<YYYYMMDD_HHMMSS>_<experiment_key>/

This means that two concurrent runs of different commodities never collide, but two runs of the same commodity within the same second would (though the same-second case is extremely unlikely and would only be a problem at stage-level resume, not full pipeline).

Key invariants

  • require_input_data_dir() is the only place that reads INPUT_DATA_DIR from the environment. All other code receives it as config.data_root.
  • User-facing YAML files must never contain a data_root key. The only legitimate source of data_root is the default_factory=require_input_data_dir on ExperimentConfig.data_root (config.py:643).
  • If INPUT_DATA_DIR is an s3:// URI (QA environment), data_root is a CloudPath; every path-handling call must respect the S3-path-safety rules (see s3_path_safety).

How it interacts with the pipeline

At config-load time (the moment any ExperimentConfig is instantiated), Pydantic calls require_input_data_dir() as the default_factory for data_root. If the env var is missing the exception surfaces before any stage code runs. The resolved AnyPath becomes the anchor for all ResolvablePath fields via _iter_resolvable_fields and _resolve_data_paths. MLflow tracking URI anchoring (tracking_uri_anchored, lib/tracking/decorators.py:43) also reads config.data_root to pin mlruns.db next to runs/ and features/.

Pitfalls

  • Setting INPUT_DATA_DIR to a path that ends with a trailing slash does not cause a bug (AnyPath normalises it), but it produces double-slash run dirs in log messages.
  • In development, accidentally running a script from outside the repo root without exporting INPUT_DATA_DIR produces an immediate RuntimeError. This is by design.
  • In ECS, if the volume is mounted at a different path from local dev, feature parquets and run dirs land at different absolute paths. config_resolved.yaml records the actual resolved path, so it is always self-describing even if the mount point changes between environments.
  • RunDir — created under INPUT_DATA_DIR/runs/
  • s3_path_safety — how S3 URIs are handled when INPUT_DATA_DIR=s3://...
  • DESIGN.md — Clause 6 is the canonical requirement

Open questions

  • The AliasChoices("INPUT_DATA_DIR", "data_root") alias on ExperimentConfig.data_root means that setting data_root=... in a YAML config technically works (pydantic accepts it). The design intent is that this path is only used by config_resolved.yaml round-trips; there is no runtime guard preventing a user from abusing it to set an arbitrary data root via YAML.
  • The _apply_input_data_dir_as_data_root_for_run_all path in README.md suggests there is a separate override path for run all; its interaction with require_input_data_dir is not fully documented in DESIGN.md.