Access & credentials — commodity_hindcast¶
This document inventories every external system, credential, environment variable, S3 prefix, AWS surface and dashboard a new engineer needs in order to run the commodity_hindcast pipeline end-to-end. Nothing here lists a real secret value: each row points at the env var, secret name, AWS console resource or YAML field where the value is read from at runtime. If you cannot resolve one of these on day one, escalate to [PLACEHOLDER: TMI engineering lead] or [PLACEHOLDER: #market-insights Slack channel] rather than committing a workaround.
Env vars (mandatory)¶
| Var | Purpose | Required by | Default | Source |
|---|---|---|---|---|
INPUT_DATA_DIR |
Anchor for data_root; features/, runs/, mlruns.db and all relative config paths resolve under it. Fail-loud — empty/unset raises RuntimeError. |
Every CLI entry point, dashboard, eval shim. | none (mandatory) | market_insights_models/src/commodity_hindcast/config.py:50-66 (require_input_data_dir) |
COMMODITY_HINDCAST_CONFIG |
Path to the resolved experiment YAML; written by the CLI before importing ExperimentConfig so pydantic-settings picks it up. |
All cli run … and stage commands. |
derived from --config |
market_insights_models/src/commodity_hindcast/cli.py:122; consumed at config.py:118 |
COMMODITY_HINDCAST_ROOT |
Override for monorepo root used when resolving relative config paths. | Optional; only when invoking from outside the repo. | _find_project_root() (walks up to pyproject.toml) |
market_insights_models/src/commodity_hindcast/config.py:110-113 |
CROP_YIELD_GEOBOUNDARIES_FILE |
Geometry parquet used by universe_coverage and choropleth panels. Non-US runs need the all-countries file. |
Diagnostics plots. | [PLACEHOLDER: bundled default in crop_yield] |
market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:115-126 |
COMMODITY_HINDCAST_BOUNDARIES_FILE |
County-level boundary override for hindcast plots. | Diagnostics plots (county panels). | none | market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:42,49 |
COMMODITY_HINDCAST_STATE_BOUNDARIES_FILE |
State-level boundary override for hindcast plots. | Diagnostics plots (state panels). | none | market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:43,62 |
COMMODITY_HINDCAST_FORCE_STRESS_ASSEMBLY |
Force YTD stress zarr re-assembly even if cached. | features/builders/stress.py (rare debug use). |
false |
market_insights_models/src/commodity_hindcast/features/builders/stress.py:31 |
HINDCAST_RUNS_DIR |
Override the runs-tree path the Streamlit dashboard reads from. | app/app.py. |
data_root / "runs" |
market_insights_models/src/commodity_hindcast/app/_dashboard_config.py:41-49 |
PIPELINE_RUN_ID |
Pipeline UUID stamped onto delivery exports; required when --export is set. |
cli run export, delivery/export.py. |
none (must be set by ECS task) | market_insights_models/src/commodity_hindcast/cli.py:137-141; delivery/export.py:362,372 |
MODEL_INGESTION_PATH |
S3 destination for the export artefact; required when --export is set. |
cli run export, delivery/export.py. |
none (must be set by ECS task) | market_insights_models/src/commodity_hindcast/cli.py:137-141; delivery/export.py:362,373 |
AWS_PROFILE (platform-qa) |
Selects the AWS credentials profile used for ECR login, ECS task launch and S3 reads against the QA buckets during local dev. | dev_tools/ecr-login.sh, dev_tools/build_light_worker.sh, the QA-test workflow. |
none | dev_tools/ecr-login.sh:3; .claude/skills/qa-test/SKILL.md:50 |
Note: standard AWS_REGION / AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN are read by boto3 and cloudpathlib; the code does not call os.getenv on them directly. Use aws sso login --profile platform-qa rather than long-lived keys. The QA region is us-east-2 (dev_tools/ecr-login.sh:4; .github/workflows/qa-to-prod-sync.yml:21).
S3 buckets¶
| Bucket | Purpose | Read/Write | Resolved by | Source |
|---|---|---|---|---|
s3://{env}-treefera-greenprint-data/weather/processed/{indices,stress,climo_indices,climatology,areal_aggregation}/... |
Weather indices, YTD stress, materialised climo, areal aggregation zarrs consumed by every commodity. {env} expands to qa or prod via expand_env_template. |
Read | expand_env_template in treefera_market_insights.shared.utils.env_templates, called from resolve_data_path |
market_insights_models/src/commodity_hindcast/configs/corn_usa.yaml:230,242,249,256,346,348; wheat_usa.yaml:277,291,302,309,380,381; cotton_usa.yaml:18,19,225,232; soybeans_usa.yaml:223; config.py:82-100 |
s3://qa-treefera-greenprint-data/usda/nass/, s3://qa-treefera-greenprint-data/wasde/ |
NASS yields and WASDE final values feeding the hindcast. Synced QA -> prod nightly. | Read | Resolved against data_root via the data symlink at the repo root, or via s3:// URIs in configs. |
config/sync/qa_to_prod.yaml:18-19 |
s3://qa-market-insights-pipeline-data/... |
ECS pipeline working data: usda_<commodity>_yield/<uuid>/runs/<stamp>_<key>/. Each ECS task receives MODEL_INGESTION_PATH rooted here. |
Read/write (per-task) | delivery/export.py writes the export artefact; the example URI is the canonical layout. |
market_insights_models/src/commodity_hindcast/delivery/export.py:403 |
s3://prod-treefera-greenprint-data/..., s3://prod-market-insights-pipeline-data/... |
Production mirror of the above. | Read/write | Same {env} template expansion. |
.github/workflows/qa-to-prod-sync.yml:9-23; config/sync/qa_to_prod.yaml:5 |
The full list of QA prefixes covered by the nightly sync lives in config/sync/qa_to_prod.yaml. New paths must be added there and to the sync IAM role before they will reach prod.
MLflow¶
- Tracking URI default:
sqlite:///mlruns.db(per-config;corn_usa.yaml:9,soybeans_usa.yaml:10,wheat_usa.yaml:11,cotton_usa.yaml:10,soybeans_bra.yaml:20). - Backing file location: anchored at
data_root(=INPUT_DATA_DIR) bytracking_uri_anchored(market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py:43-78). On a dev EC2 withINPUT_DATA_DIR=/data/processing/github/treefera-market-insights, that is/<repo_root>/mlruns.db. - When
data_rootis ans3://URI, the relative sqlite URI passes through unchanged and a warning is logged. For ECS, setmlflow_tracking_urito an absolute local path (e.g.sqlite:////tmp/mlruns.db) or an HTTP(S) tracking server. MLFLOW_TRACKING_URIenv var is not read by the code — set the URI in the YAML, not the environment.- Experiment naming:
mlflow.set_experiment(config.commodity.experiment_key)(lib/tracking/decorators.py:84);experiment_keyis<commodity>_<iso3>(e.g.corn_usa,soybeans_bra) perconfig.py:433. - Run naming:
<stage>_<commodity>_<stamp>(README.md:144). - Concurrent runs of the same commodity will deadlock SQLite — see MEMORY.md.
- UI:
[PLACEHOLDER: hosted MLflow UI URL — currently file-only sqlite, no shared server].
AWS resources¶
- ECR registry:
390844765299.dkr.ecr.us-east-2.amazonaws.com(dev_tools/ecr-login.sh:4;.claude/skills/qa-test/SKILL.md:53). - QA ECR repo:
qa-market-insights-repo; image tagqa-light_worker-latest(.claude/skills/qa-test/references/troubleshooting.md:45,59,62). - Region:
us-east-2(.github/workflows/qa-to-prod-sync.yml:21). - Account ID:
390844765299(the ECR registry prefix).[PLACEHOLDER: human-readable AWS account name]. - ECS clusters / task definitions: defined in
terraform/envs/qa/andterraform/terraform_modules/{ecr-repo,ecs-cluster-task-def,lambda-function-and-ecs-task-def}/. State backend:terraform/envs/qa/backend.s3.tfbackend.[PLACEHOLDER: prod terraform env path — only qa is checked in here]. - Build pipeline:
dev_tools/build_light_worker.sh(build, smoke-test, push), launched bydev_tools/launch_models_scheduler.py --env qa --model <name> --wait(.claude/skills/qa-test/SKILL.md:32-66). - Role assumption: GitHub Actions assume an OIDC role for the nightly sync (
.github/workflows/qa-to-prod-sync.yml:25-28declaresid-token: write).[PLACEHOLDER: role ARN / trust policy owner]. - CodeArtifact: token sourced via
scripts/codeartifact/get_pip_index_url.shand theCA_INDEX_URLenv var (.claude/skills/qa-test/SKILL.md:62).
CLI commands¶
All commands are reachable as uv run commodity-hindcast <cmd> or uv run python -m market_insights_models.src.commodity_hindcast.cli <cmd>. Source: market_insights_models/src/commodity_hindcast/cli.py.
| Command | What it does | Source |
|---|---|---|
run features |
Build fit.parquet + pred.parquet from yields/weather/climo/NDVI/stress builders. |
cli.py:180 |
run hindcast |
Walk-forward CV folds + production fit; consumes existing feature parquets. | cli.py:209 |
run all |
Hindcast + forecast (+ optional --export). |
cli.py:233 |
run fit-production |
Production model fit only. | cli.py:334 |
run forecast-features |
Build forecast-time features at --init-date. |
cli.py:355 |
run forecast-predict |
Predict only (forecast features must already exist). | cli.py:406 |
run forecast |
Point-in-time forecast (--season-year, --init-date) against an existing --run-dir. |
cli.py:450 |
run export |
Emit client-facing artefact to MODEL_INGESTION_PATH. |
cli.py:500 |
postprocess |
Aggregate to national, bias correction, conformal CIs. | cli.py:532 |
evaluate |
Metrics + plots for an existing run_dir. |
cli.py:549 |
investigate |
Post-hoc scenario sweeps. | cli.py:568 |
plots |
Regenerate PNGs only. | cli.py:600 |
predict |
Stand-alone predict stage (no postprocess). | cli.py:626 |
deliver |
Generate client-facing CSVs from postprocessed results. | cli.py:654 |
There is no cli run preflight subcommand — preflight (run/preflight.py) is invoked internally at the start of run features/run hindcast/run forecast. To validate access manually, run uv run commodity-hindcast run features --config <name> against a known-good config and stop after the preflight log lines.
The Makefile exposes make features|hindcast|fit-production|forecast|postprocess|evaluate|investigate|plots|deliver, each parameterised by EXPERIMENT_KEY=<commodity>_<iso3> and run from REPO_ROOT (market_insights_models/src/commodity_hindcast/Makefile:1-96).
Dashboard¶
Streamlit app: market_insights_models/src/commodity_hindcast/app/app.py.
Launch: uv run streamlit run market_insights_models/src/commodity_hindcast/app/app.py with INPUT_DATA_DIR set; on a fresh EC2 also set PYTHONPATH=<repo_root> (per MEMORY.md project_streamlit_app_launch.md). Override the runs source via HINDCAST_RUNS_DIR (app/_dashboard_config.py:41-49).
Hosted URL: [PLACEHOLDER: internal Streamlit deployment URL — currently dev-only].
First-day checklist¶
-
aws sso login --profile platform-qaand confirmaws sts get-caller-identity --profile platform-qareturns account390844765299. - Export
INPUT_DATA_DIR=<absolute path>(use the repo root on a dev EC2; the container mount on ECS). -
aws s3 ls s3://qa-treefera-greenprint-data/weather/processed/indices/ --profile platform-qa— sanity-check read access. -
aws s3 ls s3://qa-market-insights-pipeline-data/ --profile platform-qa— sanity-check write target listing. -
aws ecr get-login-password --region us-east-2 --profile platform-qa | docker login --username AWS --password-stdin 390844765299.dkr.ecr.us-east-2.amazonaws.com. -
uv run commodity-hindcast run features --config corn_usaagainst a freshINPUT_DATA_DIRto exercise preflight + MLflow bootstrap (createsmlruns.db). - Open
mlruns.dbwithmlflow ui --backend-store-uri sqlite:///$INPUT_DATA_DIR/mlruns.dbto verify experiment creation. -
[PLACEHOLDER: join Slack channels — #market-insights, #data-platform, on-call rotation]. -
[PLACEHOLDER: confirm GitHub team membership — treefera/market-insights-engineers]. -
[PLACEHOLDER: 1Password / vault access for any rotated keys].