S3 Path Safety¶

What it is¶

The commodity_hindcast pipeline can operate against either a local filesystem or an S3 bucket. To do so safely it relies on a three-component scheme:

cloudpathlib.AnyPath — a factory that returns either a pathlib.Path or a cloudpathlib.S3Path depending on whether the input string starts with s3://. Functions that accept paths type-hint as Path | AnyPath.
ResolvablePath (lib/path_utils.py:60) — a Annotated[AnyPath, _ResolveAgainstDataRoot()] type alias. Pydantic fields that carry this annotation are auto-discovered by _iter_resolvable_fields and resolved via resolve_data_path at config-validation time. S3 URIs and absolute paths pass through unchanged; relative paths are anchored at data_root (which is always INPUT_DATA_DIR).
AnyPathParam (in cli.py) — a custom Click ParamType that validates path arguments via cloudpathlib rather than os.path.exists, making any CLI argument that carries a run_dir or file path usable with s3:// URIs.

The formal design authority is DESIGN.md Clauses 27, 28, and 29.

Where it lives¶

Symbol	File	Line
`resolve_data_path`	`lib/path_utils.py`	27
`_ResolveAgainstDataRoot`	`lib/path_utils.py`	46
`ResolvablePath`	`lib/path_utils.py`	60
`_iter_resolvable_fields`	`lib/path_utils.py`	63
`AnyPathParam`	`cli.py`	(see PR-345)

The three-layer bug stack (PR-345)¶

PR #345 (tl/fix-path-issues, merged 2026-04-29) fixed three cascading bugs that together prevented cli predict s3://bucket/run_dir ... from working. The PR body describes them as a literal stack: fixing each layer exposed the next.

┌──────────────────────────────────────────────────────────────┐
│ Layer 1 — stages/run_predict.py                              │
│   Path(AnyPath('s3://x')) collapses the URI to a local       │
│   cache path; .is_dir() then queries an empty /tmp dir       │
│   FIX (e45bb6b2): drop the Path() wrapper                    │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 2 — stages/run_predict.py                              │
│   pl.scan_parquet(<S3Path>) raises TypeError;                │
│   polars wants a URI string, not a CloudPath object          │
│   FIX (1c11d710): wrap with str()                            │
└──────────────────────────────────────────────────────────────┘
                           │
                           ▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 3 — cli.py                                             │
│   click.Path(exists=True) validates against the local fs,    │
│   rejecting any s3:// URI before the command body runs       │
│   FIX (94daa063): custom AnyPathParam click ParamType        │
└──────────────────────────────────────────────────────────────┘

Layer 1 — `Path(AnyPath('s3://...'))` collapses to local cache¶

cloudpathlib.AnyPath('s3://bucket/key') returns an S3Path. Wrapping it in pathlib.Path(...) calls os.fspath(), which returns the local cache path (e.g. /tmp/tmpXXX/bucket/key). The directory has not been populated; .is_dir() queries an empty /tmp directory and returns False. This was the root cause of the QA failure described in DESIGN.md Clause 27. Fix: drop three Path(...) wrappers in stages/run_predict.py; widen type hints to Path | AnyPath.

Layer 2 — Polars rejects `CloudPath` objects¶

Polars accepts str | Path | list | IO[bytes] | bytes and supports s3:// URIs natively via its Rust object_store backend — but the URI must be a string, not a cloudpathlib CloudPath object:

pl.scan_parquet(<S3Path>)  ->  TypeError: Object does not have a .read() method.
pl.scan_parquet(str(...))  ->  OK, returned LazyFrame

The canonical pattern (DESIGN.md Clause 29) is pl.scan_parquet(str(path)).

Layer 3 — `click.Path(exists=True)` rejects `s3://` URIs¶

Click's built-in click.Path validates via os.path.exists (local filesystem only). It rejected s3:// URIs before the command body ran, even with path_type=AnyPath. The fix replaces 13 click.Path(path_type=AnyPath, ...) call sites with AnyPathParam, which delegates existence checks to cloudpathlib so that S3Path.exists() queries S3 and Path.exists() queries the local filesystem transparently.

The `_iter_resolvable_fields` auto-discovery mechanism¶

lib/path_utils.py:63–87 contains the walker that makes preflight coverage self-maintaining. Its signature is:

def _iter_resolvable_fields(model: BaseModel) -> Iterator[tuple[BaseModel, str]]:

It walks the Pydantic model tree by inspecting get_type_hints(type(model), include_extras=True) and yielding every (owner, field_name) pair whose annotation carries _ResolveAgainstDataRoot. It recurses into:

Nested BaseModel instances (e.g. ForecastConfig inside ExperimentConfig)
dict values that are BaseModel instances
list items that are BaseModel instances

None values are skipped. Two consumers iterate the emitted pairs:

ExperimentConfig._resolve_data_paths — resolves each field in place at config validation time.
preflight_paths_for_* — generates the existence-check set for each stage, so that adding a new ResolvablePath field automatically extends preflight coverage without any hand-maintained list (DESIGN.md Clause 31).

Key invariants¶

Never wrap AnyPath / S3Path in pathlib.Path(...): __fspath__ returns the local cache path, not the URI.
Always str(path) before pl.scan_parquet or pd.read_parquet: polars and pandas reject CloudPath objects.
Never call .as_posix() on an AnyPath that may be an S3Path: .as_posix() is a pathlib.Path method and resolves to the cache path on S3Path.
Never anchor SQLite/lock files under data_root in QA when INPUT_DATA_DIR is an s3:// URI: SQLite cannot live on object storage. The tracking_uri_anchored function (lib/tracking/decorators.py:43) detects a CloudPath anchor and emits a warning, leaving the URI cwd-relative rather than crashing.

How it interacts with the pipeline¶

Every stage that reads a config-derived path goes through resolve_data_path (called by ExperimentConfig._resolve_data_paths during Pydantic validation). The stage code itself never calls resolve_data_path directly; it receives already-resolved ResolvablePath values from the config. CLI arguments that accept a run_dir or any filesystem path go through AnyPathParam.convert, which validates existence via the appropriate backend before the command body runs.

Pitfalls¶

The tracking_uri_anchored silent-fallback behaviour means that in QA (where INPUT_DATA_DIR is an S3 URI), mlruns.db resolves against the process cwd. Set mlflow_tracking_uri to an absolute local path (e.g. sqlite:////tmp/mlruns.db) or an HTTP(S) MLflow server to get a stable location.
The cloudpathlib local cache path is a valid pathlib.Path; downstream .exists() calls on it return False but do not raise, making the bug silent.
pd.read_parquet and pl.read_parquet both require str(path) for S3, not just pl.scan_parquet. The Clause 29 pattern applies to all polars/pandas readers.

RunDir — the on-disk persistence root; may be an S3 URI
PR-345 — the three-layer fix
DESIGN.md Clauses 27–29 — the formal requirements

PRs¶

PR	Relevance
PR-345	Introduced `AnyPathParam`; dropped `Path(AnyPath(...))` wrappers; added `str()` to polars calls

Open questions¶

The _iter_resolvable_fields walker does not handle tuple containers of BaseModel; if a future config field uses tuple[SomeModel, ...] typed as ResolvablePath, it will be silently skipped.
AnyPathParam replaces 13 call sites as of PR-345; any new CLI run subcommand that adds a path argument must use it explicitly — there is no enforcement mechanism.