S3 Path Safety¶
What it is¶
The commodity_hindcast pipeline can operate against either a local filesystem or an S3 bucket. To do so safely it relies on a three-component scheme:
-
cloudpathlib.AnyPath— a factory that returns either apathlib.Pathor acloudpathlib.S3Pathdepending on whether the input string starts withs3://. Functions that accept paths type-hint asPath | AnyPath. -
ResolvablePath(lib/path_utils.py:60) — aAnnotated[AnyPath, _ResolveAgainstDataRoot()]type alias. Pydantic fields that carry this annotation are auto-discovered by_iter_resolvable_fieldsand resolved viaresolve_data_pathat config-validation time. S3 URIs and absolute paths pass through unchanged; relative paths are anchored atdata_root(which is alwaysINPUT_DATA_DIR). -
AnyPathParam(incli.py) — a custom ClickParamTypethat validates path arguments via cloudpathlib rather thanos.path.exists, making any CLI argument that carries arun_diror file path usable withs3://URIs.
The formal design authority is DESIGN.md Clauses 27, 28, and 29.
Where it lives¶
| Symbol | File | Line |
|---|---|---|
resolve_data_path |
lib/path_utils.py |
27 |
_ResolveAgainstDataRoot |
lib/path_utils.py |
46 |
ResolvablePath |
lib/path_utils.py |
60 |
_iter_resolvable_fields |
lib/path_utils.py |
63 |
AnyPathParam |
cli.py |
(see PR-345) |
The three-layer bug stack (PR-345)¶
PR #345 (tl/fix-path-issues, merged 2026-04-29) fixed three cascading bugs that
together prevented cli predict s3://bucket/run_dir ... from working. The PR body
describes them as a literal stack: fixing each layer exposed the next.
┌──────────────────────────────────────────────────────────────┐
│ Layer 1 — stages/run_predict.py │
│ Path(AnyPath('s3://x')) collapses the URI to a local │
│ cache path; .is_dir() then queries an empty /tmp dir │
│ FIX (e45bb6b2): drop the Path() wrapper │
└──────────────────────────────────────────────────────────────┘
│
▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 2 — stages/run_predict.py │
│ pl.scan_parquet(<S3Path>) raises TypeError; │
│ polars wants a URI string, not a CloudPath object │
│ FIX (1c11d710): wrap with str() │
└──────────────────────────────────────────────────────────────┘
│
▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 3 — cli.py │
│ click.Path(exists=True) validates against the local fs, │
│ rejecting any s3:// URI before the command body runs │
│ FIX (94daa063): custom AnyPathParam click ParamType │
└──────────────────────────────────────────────────────────────┘
Layer 1 — Path(AnyPath('s3://...')) collapses to local cache¶
cloudpathlib.AnyPath('s3://bucket/key') returns an S3Path. Wrapping it in
pathlib.Path(...) calls os.fspath(), which returns the local cache path
(e.g. /tmp/tmpXXX/bucket/key). The directory has not been populated; .is_dir()
queries an empty /tmp directory and returns False. This was the root cause of the
QA failure described in DESIGN.md Clause 27. Fix: drop three Path(...) wrappers in
stages/run_predict.py; widen type hints to Path | AnyPath.
Layer 2 — Polars rejects CloudPath objects¶
Polars accepts str | Path | list | IO[bytes] | bytes and supports s3:// URIs
natively via its Rust object_store backend — but the URI must be a string, not a
cloudpathlib CloudPath object:
pl.scan_parquet(<S3Path>) -> TypeError: Object does not have a .read() method.
pl.scan_parquet(str(...)) -> OK, returned LazyFrame
The canonical pattern (DESIGN.md Clause 29) is pl.scan_parquet(str(path)).
Layer 3 — click.Path(exists=True) rejects s3:// URIs¶
Click's built-in click.Path validates via os.path.exists (local filesystem only).
It rejected s3:// URIs before the command body ran, even with path_type=AnyPath.
The fix replaces 13 click.Path(path_type=AnyPath, ...) call sites with AnyPathParam,
which delegates existence checks to cloudpathlib so that S3Path.exists() queries S3
and Path.exists() queries the local filesystem transparently.
The _iter_resolvable_fields auto-discovery mechanism¶
lib/path_utils.py:63–87 contains the walker that makes preflight coverage
self-maintaining. Its signature is:
It walks the Pydantic model tree by inspecting get_type_hints(type(model), include_extras=True)
and yielding every (owner, field_name) pair whose annotation carries
_ResolveAgainstDataRoot. It recurses into:
- Nested
BaseModelinstances (e.g.ForecastConfiginsideExperimentConfig) dictvalues that areBaseModelinstanceslistitems that areBaseModelinstances
None values are skipped. Two consumers iterate the emitted pairs:
ExperimentConfig._resolve_data_paths— resolves each field in place at config validation time.preflight_paths_for_*— generates the existence-check set for each stage, so that adding a newResolvablePathfield automatically extends preflight coverage without any hand-maintained list (DESIGN.md Clause 31).
Key invariants¶
- Never wrap
AnyPath/S3Pathinpathlib.Path(...):__fspath__returns the local cache path, not the URI. - Always
str(path)beforepl.scan_parquetorpd.read_parquet: polars and pandas rejectCloudPathobjects. - Never call
.as_posix()on anAnyPaththat may be anS3Path:.as_posix()is apathlib.Pathmethod and resolves to the cache path onS3Path. - Never anchor SQLite/lock files under
data_rootin QA whenINPUT_DATA_DIRis ans3://URI: SQLite cannot live on object storage. Thetracking_uri_anchoredfunction (lib/tracking/decorators.py:43) detects aCloudPathanchor and emits a warning, leaving the URI cwd-relative rather than crashing.
How it interacts with the pipeline¶
Every stage that reads a config-derived path goes through resolve_data_path (called
by ExperimentConfig._resolve_data_paths during Pydantic validation). The stage code
itself never calls resolve_data_path directly; it receives already-resolved
ResolvablePath values from the config. CLI arguments that accept a run_dir or any
filesystem path go through AnyPathParam.convert, which validates existence via the
appropriate backend before the command body runs.
Pitfalls¶
- The
tracking_uri_anchoredsilent-fallback behaviour means that in QA (whereINPUT_DATA_DIRis an S3 URI),mlruns.dbresolves against the process cwd. Setmlflow_tracking_urito an absolute local path (e.g.sqlite:////tmp/mlruns.db) or an HTTP(S) MLflow server to get a stable location. - The cloudpathlib local cache path is a valid
pathlib.Path; downstream.exists()calls on it returnFalsebut do not raise, making the bug silent. pd.read_parquetandpl.read_parquetboth requirestr(path)for S3, not justpl.scan_parquet. The Clause 29 pattern applies to all polars/pandas readers.
Related entities and concepts¶
- RunDir — the on-disk persistence root; may be an S3 URI
- PR-345 — the three-layer fix
- DESIGN.md Clauses 27–29 — the formal requirements
PRs¶
| PR | Relevance |
|---|---|
| PR-345 | Introduced AnyPathParam; dropped Path(AnyPath(...)) wrappers; added str() to polars calls |
Open questions¶
- The
_iter_resolvable_fieldswalker does not handletuplecontainers ofBaseModel; if a future config field usestuple[SomeModel, ...]typed asResolvablePath, it will be silently skipped. AnyPathParamreplaces 13 call sites as of PR-345; any new CLIrunsubcommand that adds a path argument must use it explicitly — there is no enforcement mechanism.