Skip to content

PR #345 — fix(commodity_hindcast): S3 path support across predict stage + CLI

At a glance

  • Author: ai-tommytf
  • Merged: 2026-04-29
  • Branch: tl/fix-path-issues
  • Net effect: Three layered bug fixes that together allow cli predict s3://bucket/run_dir ... to work end-to-end: (1) drop Path(AnyPath(...)) wrappers that collapsed S3 URIs to local cache paths; (2) wrap pl.scan_parquet(...) arguments with str() because polars wants a URI string, not a CloudPath object; (3) replace all click.Path(path_type=AnyPath) usages with a custom AnyPathParam click ParamType that validates existence via cloudpathlib rather than the local filesystem.
  • Why this matters: Debugging the predict stage now takes ~30 seconds against an existing S3 run_dir instead of ~50 minutes per ECS round-trip.

PR body (faithful extract)

The problem in one sentence

`Path(AnyPath('s3://...'))` silently collapses an S3 URI to an empty `/tmp/tmpXXX/<bucket>/<key>`
local-cache path, so every downstream `.is_dir()` / `.exists()` / `pl.scan_parquet(...)` call
answers a question about the wrong filesystem.

Three-layer bug stack

┌──────────────────────────────────────────────────────────────┐
│ Layer 1 — stages/run_predict.py                              │
│   Path(AnyPath('s3://x')) collapses the URI to a local       │
│   cache path; .is_dir() then queries an empty /tmp dir       │
│   FIX (e45bb6b2): drop the Path() wrapper                    │
└──────────────────────────────────────────────────────────────┘
                           ▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 2 — stages/run_predict.py                              │
│   pl.scan_parquet(<S3Path>) raises TypeError;                │
│   polars wants a URI string, not a CloudPath object          │
│   FIX (1c11d710): wrap with str()                            │
└──────────────────────────────────────────────────────────────┘
                           ▼ (fix uncovers the next layer)
┌──────────────────────────────────────────────────────────────┐
│ Layer 3 — cli.py                                             │
│   click.Path(exists=True) validates against the local fs,    │
│   rejecting any s3:// URI before the command body runs       │
│   FIX (94daa063): custom AnyPathParam click ParamType        │
└──────────────────────────────────────────────────────────────┘

Layer 1 — Path(AnyPath('s3://...')) collapses to local cache

cloudpathlib.AnyPath('s3://bucket/key') returns an S3Path. Wrapping it in pathlib.Path(...) calls os.fspath() which returns the local cache path (e.g. /tmp/tmpXXX/bucket/key), not the URI. The directory has not been populated; .is_dir() queries an empty /tmp dir and returns False.

Fix: drop three Path(...) wrappers in stages/run_predict.py; widen type hints to Path | AnyPath.

Layer 2 — pl.scan_parquet rejects CloudPath objects

Polars accepts str | Path | list | IO[bytes] | bytes and supports s3:// URIs natively via its Rust object_store backend — but the URI must be a string, not a cloudpathlib CloudPath object:

pl.scan_parquet(<S3Path>)  -> TypeError: argument 'sources': Object does not have a .read() method.
pl.scan_parquet(str(...))  -> OK, returned LazyFrame

Fix: pl.scan_parquet(str(pred_parquet)).

Layer 3 — Custom AnyPathParam click ParamType

click.Path(exists=True) validates via os.path.exists (local fs only). It rejected s3:// URIs before the command body ran, even with path_type=AnyPath.

The fix is a custom AnyPathParam that delegates validation to cloudpathlib:

class AnyPathParam(click.ParamType):
    """Click ParamType accepting local paths and cloud URIs (s3://, gs://, az://).

    Built-in `click.Path` validates existence via `os.path.exists` (local fs
    only), so passing `s3://...` URIs to commands like `cli predict` failed
    with "Directory ... does not exist" even though path_type=AnyPath was set.
    This type validates via cloudpathlib for cloud URIs and pathlib for local
    paths, so the same flag can target either backend transparently.
    """
    name = "anypath"

    def convert(self, value, param, ctx) -> Path | CloudPath:
        path = AnyPath(str(value)) if not isinstance(value, (Path, CloudPath)) else value
        # ... validates existence, file, dir via path.exists() / path.is_file() / path.is_dir()
        # returns clean click BadParameter on failure
        return path

Replaces 13 click.Path(path_type=AnyPath, ...) call sites.

Development velocity impact

BEFORE: edit → build image → push to ECR → ECS → ~50 min pipeline → predict ← the bug
AFTER:  edit → cli predict s3://...run_dir → ~30 seconds

~100x speedup on the inner debugging loop.

Verification

All checks passed!    (ty check stages/run_predict.py)
294 passed, 1 warning in 13.92s

Files / lines touched

Additions Deletions File
+439 -0 docs/s3_path_support_demo.md
+0 -361 market_insights_models/src/commodity_hindcast/DESIGN_ADDENDUM_334.md
+0 -325 market_insights_models/src/commodity_hindcast/IMPLEMENTATION_PLAN.md
+0 -239 market_insights_models/src/commodity_hindcast/ORCHESTRATION.md
+0 -170 market_insights_models/src/commodity_hindcast/SYNTHESIS.md
+43 -51 market_insights_models/src/commodity_hindcast/TODO.md
+55 -14 market_insights_models/src/commodity_hindcast/cli.py
+19 -16 market_insights_models/src/commodity_hindcast/stages/run_predict.py
+13 -5 market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py
+6 -0 AGENTS.md

Cross-references

Lessons captured

  • Never wrap an AnyPath / S3Path in pathlib.Path(...)__fspath__ returns the local cache path, not the URI.
  • pl.scan_parquet(...) requires a URI string for S3 paths; wrap with str(...).
  • Use AnyPathParam (a custom click.ParamType) instead of click.Path(path_type=AnyPath) for any CLI argument that may be an S3 URI; the built-in click.Path validates via os.path.exists (local fs only).
  • AnyPathParam validates existence via cloudpathlib so S3Path.exists() queries S3 and Path.exists() queries the local fs — the same flag handles both transparently.
  • All five top-level stage commands (predict, postprocess, evaluate, plots, deliver) and five run subcommands now accept s3:// URIs as run_dir.