PR #345 — fix(commodity_hindcast): S3 path support across predict stage + CLI¶
At a glance¶
- Author: ai-tommytf
- Merged: 2026-04-29
- Branch:
tl/fix-path-issues - Net effect: Three layered bug fixes that together allow
cli predict s3://bucket/run_dir ...to work end-to-end: (1) dropPath(AnyPath(...))wrappers that collapsed S3 URIs to local cache paths; (2) wrappl.scan_parquet(...)arguments withstr()because polars wants a URI string, not a CloudPath object; (3) replace allclick.Path(path_type=AnyPath)usages with a customAnyPathParamclickParamTypethat validates existence via cloudpathlib rather than the local filesystem. - Why this matters: Debugging the predict stage now takes ~30 seconds against an existing S3 run_dir instead of ~50 minutes per ECS round-trip.
PR body (faithful extract)¶
The problem in one sentence¶
`Path(AnyPath('s3://...'))` silently collapses an S3 URI to an empty `/tmp/tmpXXX/<bucket>/<key>`
local-cache path, so every downstream `.is_dir()` / `.exists()` / `pl.scan_parquet(...)` call
answers a question about the wrong filesystem.
Three-layer bug stack¶
┌──────────────────────────────────────────────────────────────┐
│ Layer 1 — stages/run_predict.py │
│ Path(AnyPath('s3://x')) collapses the URI to a local │
│ cache path; .is_dir() then queries an empty /tmp dir │
│ FIX (e45bb6b2): drop the Path() wrapper │
└──────────────────────────────────────────────────────────────┘
│
▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 2 — stages/run_predict.py │
│ pl.scan_parquet(<S3Path>) raises TypeError; │
│ polars wants a URI string, not a CloudPath object │
│ FIX (1c11d710): wrap with str() │
└──────────────────────────────────────────────────────────────┘
│
▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 3 — cli.py │
│ click.Path(exists=True) validates against the local fs, │
│ rejecting any s3:// URI before the command body runs │
│ FIX (94daa063): custom AnyPathParam click ParamType │
└──────────────────────────────────────────────────────────────┘
Layer 1 — Path(AnyPath('s3://...')) collapses to local cache¶
cloudpathlib.AnyPath('s3://bucket/key') returns an S3Path. Wrapping it in pathlib.Path(...) calls os.fspath() which returns the local cache path (e.g. /tmp/tmpXXX/bucket/key), not the URI. The directory has not been populated; .is_dir() queries an empty /tmp dir and returns False.
Fix: drop three Path(...) wrappers in stages/run_predict.py; widen type hints to Path | AnyPath.
Layer 2 — pl.scan_parquet rejects CloudPath objects¶
Polars accepts str | Path | list | IO[bytes] | bytes and supports s3:// URIs natively via its Rust object_store backend — but the URI must be a string, not a cloudpathlib CloudPath object:
pl.scan_parquet(<S3Path>) -> TypeError: argument 'sources': Object does not have a .read() method.
pl.scan_parquet(str(...)) -> OK, returned LazyFrame
Fix: pl.scan_parquet(str(pred_parquet)).
Layer 3 — Custom AnyPathParam click ParamType¶
click.Path(exists=True) validates via os.path.exists (local fs only). It rejected s3:// URIs before the command body ran, even with path_type=AnyPath.
The fix is a custom AnyPathParam that delegates validation to cloudpathlib:
class AnyPathParam(click.ParamType):
"""Click ParamType accepting local paths and cloud URIs (s3://, gs://, az://).
Built-in `click.Path` validates existence via `os.path.exists` (local fs
only), so passing `s3://...` URIs to commands like `cli predict` failed
with "Directory ... does not exist" even though path_type=AnyPath was set.
This type validates via cloudpathlib for cloud URIs and pathlib for local
paths, so the same flag can target either backend transparently.
"""
name = "anypath"
def convert(self, value, param, ctx) -> Path | CloudPath:
path = AnyPath(str(value)) if not isinstance(value, (Path, CloudPath)) else value
# ... validates existence, file, dir via path.exists() / path.is_file() / path.is_dir()
# returns clean click BadParameter on failure
return path
Replaces 13 click.Path(path_type=AnyPath, ...) call sites.
Development velocity impact¶
BEFORE: edit → build image → push to ECR → ECS → ~50 min pipeline → predict ← the bug
AFTER: edit → cli predict s3://...run_dir → ~30 seconds
~100x speedup on the inner debugging loop.
Verification¶
Files / lines touched¶
| Additions | Deletions | File |
|---|---|---|
| +439 | -0 | docs/s3_path_support_demo.md |
| +0 | -361 | market_insights_models/src/commodity_hindcast/DESIGN_ADDENDUM_334.md |
| +0 | -325 | market_insights_models/src/commodity_hindcast/IMPLEMENTATION_PLAN.md |
| +0 | -239 | market_insights_models/src/commodity_hindcast/ORCHESTRATION.md |
| +0 | -170 | market_insights_models/src/commodity_hindcast/SYNTHESIS.md |
| +43 | -51 | market_insights_models/src/commodity_hindcast/TODO.md |
| +55 | -14 | market_insights_models/src/commodity_hindcast/cli.py |
| +19 | -16 | market_insights_models/src/commodity_hindcast/stages/run_predict.py |
| +13 | -5 | market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py |
| +6 | -0 | AGENTS.md |
Cross-references¶
- Related concept pages: S3 path safety
- Related code pages: cli / orchestration, run_predict / stages
- Memory item: S3 path anchoring bugs keep recurring —
AnyPathParamis the pattern to follow
Lessons captured¶
- Never wrap an
AnyPath/S3Pathinpathlib.Path(...)—__fspath__returns the local cache path, not the URI. pl.scan_parquet(...)requires a URI string for S3 paths; wrap withstr(...).- Use
AnyPathParam(a customclick.ParamType) instead ofclick.Path(path_type=AnyPath)for any CLI argument that may be an S3 URI; the built-inclick.Pathvalidates viaos.path.exists(local fs only). AnyPathParamvalidates existence via cloudpathlib soS3Path.exists()queries S3 andPath.exists()queries the local fs — the same flag handles both transparently.- All five top-level stage commands (
predict,postprocess,evaluate,plots,deliver) and fiverunsubcommands now accepts3://URIs asrun_dir.