Skip to content

S3 Path Safety

What it is

The commodity_hindcast pipeline can operate against either a local filesystem or an S3 bucket. To do so safely it relies on a three-component scheme:

  1. cloudpathlib.AnyPath — a factory that returns either a pathlib.Path or a cloudpathlib.S3Path depending on whether the input string starts with s3://. Functions that accept paths type-hint as Path | AnyPath.

  2. ResolvablePath (lib/path_utils.py:60) — a Annotated[AnyPath, _ResolveAgainstDataRoot()] type alias. Pydantic fields that carry this annotation are auto-discovered by _iter_resolvable_fields and resolved via resolve_data_path at config-validation time. S3 URIs and absolute paths pass through unchanged; relative paths are anchored at data_root (which is always INPUT_DATA_DIR).

  3. AnyPathParam (in cli.py) — a custom Click ParamType that validates path arguments via cloudpathlib rather than os.path.exists, making any CLI argument that carries a run_dir or file path usable with s3:// URIs.

The formal design authority is DESIGN.md Clauses 27, 28, and 29.

Where it lives

Symbol File Line
resolve_data_path lib/path_utils.py 27
_ResolveAgainstDataRoot lib/path_utils.py 46
ResolvablePath lib/path_utils.py 60
_iter_resolvable_fields lib/path_utils.py 63
AnyPathParam cli.py (see PR-345)

The three-layer bug stack (PR-345)

PR #345 (tl/fix-path-issues, merged 2026-04-29) fixed three cascading bugs that together prevented cli predict s3://bucket/run_dir ... from working. The PR body describes them as a literal stack: fixing each layer exposed the next.

┌──────────────────────────────────────────────────────────────┐
│ Layer 1 — stages/run_predict.py                              │
│   Path(AnyPath('s3://x')) collapses the URI to a local       │
│   cache path; .is_dir() then queries an empty /tmp dir       │
│   FIX (e45bb6b2): drop the Path() wrapper                    │
└──────────────────────────────────────────────────────────────┘
                           ▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 2 — stages/run_predict.py                              │
│   pl.scan_parquet(<S3Path>) raises TypeError;                │
│   polars wants a URI string, not a CloudPath object          │
│   FIX (1c11d710): wrap with str()                            │
└──────────────────────────────────────────────────────────────┘
                           ▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 3 — cli.py                                             │
│   click.Path(exists=True) validates against the local fs,    │
│   rejecting any s3:// URI before the command body runs       │
│   FIX (94daa063): custom AnyPathParam click ParamType        │
└──────────────────────────────────────────────────────────────┘

Layer 1 — Path(AnyPath('s3://...')) collapses to local cache

cloudpathlib.AnyPath('s3://bucket/key') returns an S3Path. Wrapping it in pathlib.Path(...) calls os.fspath(), which returns the local cache path (e.g. /tmp/tmpXXX/bucket/key). The directory has not been populated; .is_dir() queries an empty /tmp directory and returns False. This was the root cause of the QA failure described in DESIGN.md Clause 27. Fix: drop three Path(...) wrappers in stages/run_predict.py; widen type hints to Path | AnyPath.

Layer 2 — Polars rejects CloudPath objects

Polars accepts str | Path | list | IO[bytes] | bytes and supports s3:// URIs natively via its Rust object_store backend — but the URI must be a string, not a cloudpathlib CloudPath object:

pl.scan_parquet(<S3Path>)  ->  TypeError: Object does not have a .read() method.
pl.scan_parquet(str(...))  ->  OK, returned LazyFrame

The canonical pattern (DESIGN.md Clause 29) is pl.scan_parquet(str(path)).

Layer 3 — click.Path(exists=True) rejects s3:// URIs

Click's built-in click.Path validates via os.path.exists (local filesystem only). It rejected s3:// URIs before the command body ran, even with path_type=AnyPath. The fix replaces 13 click.Path(path_type=AnyPath, ...) call sites with AnyPathParam, which delegates existence checks to cloudpathlib so that S3Path.exists() queries S3 and Path.exists() queries the local filesystem transparently.

The _iter_resolvable_fields auto-discovery mechanism

lib/path_utils.py:63–87 contains the walker that makes preflight coverage self-maintaining. Its signature is:

def _iter_resolvable_fields(model: BaseModel) -> Iterator[tuple[BaseModel, str]]:

It walks the Pydantic model tree by inspecting get_type_hints(type(model), include_extras=True) and yielding every (owner, field_name) pair whose annotation carries _ResolveAgainstDataRoot. It recurses into:

  • Nested BaseModel instances (e.g. ForecastConfig inside ExperimentConfig)
  • dict values that are BaseModel instances
  • list items that are BaseModel instances

None values are skipped. Two consumers iterate the emitted pairs:

  1. ExperimentConfig._resolve_data_paths — resolves each field in place at config validation time.
  2. preflight_paths_for_* — generates the existence-check set for each stage, so that adding a new ResolvablePath field automatically extends preflight coverage without any hand-maintained list (DESIGN.md Clause 31).

Key invariants

  • Never wrap AnyPath / S3Path in pathlib.Path(...): __fspath__ returns the local cache path, not the URI.
  • Always str(path) before pl.scan_parquet or pd.read_parquet: polars and pandas reject CloudPath objects.
  • Never call .as_posix() on an AnyPath that may be an S3Path: .as_posix() is a pathlib.Path method and resolves to the cache path on S3Path.
  • Never anchor SQLite/lock files under data_root in QA when INPUT_DATA_DIR is an s3:// URI: SQLite cannot live on object storage. The tracking_uri_anchored function (lib/tracking/decorators.py:43) detects a CloudPath anchor and emits a warning, leaving the URI cwd-relative rather than crashing.

How it interacts with the pipeline

Every stage that reads a config-derived path goes through resolve_data_path (called by ExperimentConfig._resolve_data_paths during Pydantic validation). The stage code itself never calls resolve_data_path directly; it receives already-resolved ResolvablePath values from the config. CLI arguments that accept a run_dir or any filesystem path go through AnyPathParam.convert, which validates existence via the appropriate backend before the command body runs.

Pitfalls

  • The tracking_uri_anchored silent-fallback behaviour means that in QA (where INPUT_DATA_DIR is an S3 URI), mlruns.db resolves against the process cwd. Set mlflow_tracking_uri to an absolute local path (e.g. sqlite:////tmp/mlruns.db) or an HTTP(S) MLflow server to get a stable location.
  • The cloudpathlib local cache path is a valid pathlib.Path; downstream .exists() calls on it return False but do not raise, making the bug silent.
  • pd.read_parquet and pl.read_parquet both require str(path) for S3, not just pl.scan_parquet. The Clause 29 pattern applies to all polars/pandas readers.

PRs

PR Relevance
PR-345 Introduced AnyPathParam; dropped Path(AnyPath(...)) wrappers; added str() to polars calls

Open questions

  • The _iter_resolvable_fields walker does not handle tuple containers of BaseModel; if a future config field uses tuple[SomeModel, ...] typed as ResolvablePath, it will be silently skipped.
  • AnyPathParam replaces 13 call sites as of PR-345; any new CLI run subcommand that adds a path argument must use it explicitly — there is no enforcement mechanism.