Access & credentials — commodity_hindcast¶

This document inventories every external system, credential, environment variable, S3 prefix, AWS surface and dashboard a new engineer needs in order to run the commodity_hindcast pipeline end-to-end. Nothing here lists a real secret value: each row points at the env var, secret name, AWS console resource or YAML field where the value is read from at runtime. If you cannot resolve one of these on day one, escalate to [PLACEHOLDER: TMI engineering lead] or [PLACEHOLDER: #market-insights Slack channel] rather than committing a workaround.

Env vars (mandatory)¶

Var	Purpose	Required by	Default	Source
`INPUT_DATA_DIR`	Anchor for `data_root`; `features/`, `runs/`, `mlruns.db` and all relative config paths resolve under it. Fail-loud — empty/unset raises `RuntimeError`.	Every CLI entry point, dashboard, eval shim.	none (mandatory)	`market_insights_models/src/commodity_hindcast/config.py:50-66` (`require_input_data_dir`)
`COMMODITY_HINDCAST_CONFIG`	Path to the resolved experiment YAML; written by the CLI before importing `ExperimentConfig` so pydantic-settings picks it up.	All `cli run …` and stage commands.	derived from `--config`	`market_insights_models/src/commodity_hindcast/cli.py:122`; consumed at `config.py:118`
`COMMODITY_HINDCAST_ROOT`	Override for monorepo root used when resolving relative config paths.	Optional; only when invoking from outside the repo.	`_find_project_root()` (walks up to `pyproject.toml`)	`market_insights_models/src/commodity_hindcast/config.py:110-113`
`CROP_YIELD_GEOBOUNDARIES_FILE`	Geometry parquet used by `universe_coverage` and choropleth panels. Non-US runs need the all-countries file.	Diagnostics plots.	`[PLACEHOLDER: bundled default in crop_yield]`	`market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:115-126`
`COMMODITY_HINDCAST_BOUNDARIES_FILE`	County-level boundary override for hindcast plots.	Diagnostics plots (county panels).	none	`market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:42,49`
`COMMODITY_HINDCAST_STATE_BOUNDARIES_FILE`	State-level boundary override for hindcast plots.	Diagnostics plots (state panels).	none	`market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:43,62`
`COMMODITY_HINDCAST_FORCE_STRESS_ASSEMBLY`	Force YTD stress zarr re-assembly even if cached.	`features/builders/stress.py` (rare debug use).	`false`	`market_insights_models/src/commodity_hindcast/features/builders/stress.py:31`
`HINDCAST_RUNS_DIR`	Override the runs-tree path the Streamlit dashboard reads from.	`app/app.py`.	`data_root / "runs"`	`market_insights_models/src/commodity_hindcast/app/_dashboard_config.py:41-49`
`PIPELINE_RUN_ID`	Pipeline UUID stamped onto delivery exports; required when `--export` is set.	`cli run export`, `delivery/export.py`.	none (must be set by ECS task)	`market_insights_models/src/commodity_hindcast/cli.py:137-141`; `delivery/export.py:362,372`
`MODEL_INGESTION_PATH`	S3 destination for the export artefact; required when `--export` is set.	`cli run export`, `delivery/export.py`.	none (must be set by ECS task)	`market_insights_models/src/commodity_hindcast/cli.py:137-141`; `delivery/export.py:362,373`
`AWS_PROFILE` (`platform-qa`)	Selects the AWS credentials profile used for ECR login, ECS task launch and S3 reads against the QA buckets during local dev.	`dev_tools/ecr-login.sh`, `dev_tools/build_light_worker.sh`, the QA-test workflow.	none	`dev_tools/ecr-login.sh:3`; `.claude/skills/qa-test/SKILL.md:50`

Note: standard AWS_REGION / AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN are read by boto3 and cloudpathlib; the code does not call os.getenv on them directly. Use aws sso login --profile platform-qa rather than long-lived keys. The QA region is us-east-2 (dev_tools/ecr-login.sh:4; .github/workflows/qa-to-prod-sync.yml:21).

S3 buckets¶

Bucket	Purpose	Read/Write	Resolved by	Source
`s3://{env}-treefera-greenprint-data/weather/processed/{indices,stress,climo_indices,climatology,areal_aggregation}/...`	Weather indices, YTD stress, materialised climo, areal aggregation zarrs consumed by every commodity. `{env}` expands to `qa` or `prod` via `expand_env_template`.	Read	`expand_env_template` in `treefera_market_insights.shared.utils.env_templates`, called from `resolve_data_path`	`market_insights_models/src/commodity_hindcast/configs/corn_usa.yaml:230,242,249,256,346,348`; `wheat_usa.yaml:277,291,302,309,380,381`; `cotton_usa.yaml:18,19,225,232`; `soybeans_usa.yaml:223`; `config.py:82-100`
`s3://qa-treefera-greenprint-data/usda/nass/`, `s3://qa-treefera-greenprint-data/wasde/`	NASS yields and WASDE final values feeding the hindcast. Synced QA -> prod nightly.	Read	Resolved against `data_root` via the `data` symlink at the repo root, or via `s3://` URIs in configs.	`config/sync/qa_to_prod.yaml:18-19`
`s3://qa-market-insights-pipeline-data/...`	ECS pipeline working data: `usda_<commodity>_yield/<uuid>/runs/<stamp>_<key>/`. Each ECS task receives `MODEL_INGESTION_PATH` rooted here.	Read/write (per-task)	`delivery/export.py` writes the export artefact; the example URI is the canonical layout.	`market_insights_models/src/commodity_hindcast/delivery/export.py:403`
`s3://prod-treefera-greenprint-data/...`, `s3://prod-market-insights-pipeline-data/...`	Production mirror of the above.	Read/write	Same `{env}` template expansion.	`.github/workflows/qa-to-prod-sync.yml:9-23`; `config/sync/qa_to_prod.yaml:5`

The full list of QA prefixes covered by the nightly sync lives in config/sync/qa_to_prod.yaml. New paths must be added there and to the sync IAM role before they will reach prod.

MLflow¶

Tracking URI default: sqlite:///mlruns.db (per-config; corn_usa.yaml:9, soybeans_usa.yaml:10, wheat_usa.yaml:11, cotton_usa.yaml:10, soybeans_bra.yaml:20).
Backing file location: anchored at data_root (= INPUT_DATA_DIR) by tracking_uri_anchored (market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py:43-78). On a dev EC2 with INPUT_DATA_DIR=/data/processing/github/treefera-market-insights, that is /<repo_root>/mlruns.db.
When data_root is an s3:// URI, the relative sqlite URI passes through unchanged and a warning is logged. For ECS, set mlflow_tracking_uri to an absolute local path (e.g. sqlite:////tmp/mlruns.db) or an HTTP(S) tracking server.
MLFLOW_TRACKING_URI env var is not read by the code — set the URI in the YAML, not the environment.
Experiment naming: mlflow.set_experiment(config.commodity.experiment_key) (lib/tracking/decorators.py:84); experiment_key is <commodity>_<iso3> (e.g. corn_usa, soybeans_bra) per config.py:433.
Run naming: <stage>_<commodity>_<stamp> (README.md:144).
Concurrent runs of the same commodity will deadlock SQLite — see MEMORY.md.
UI: [PLACEHOLDER: hosted MLflow UI URL — currently file-only sqlite, no shared server].

AWS resources¶

ECR registry: 390844765299.dkr.ecr.us-east-2.amazonaws.com (dev_tools/ecr-login.sh:4; .claude/skills/qa-test/SKILL.md:53).
QA ECR repo: qa-market-insights-repo; image tag qa-light_worker-latest (.claude/skills/qa-test/references/troubleshooting.md:45,59,62).
Region: us-east-2 (.github/workflows/qa-to-prod-sync.yml:21).
Account ID: 390844765299 (the ECR registry prefix). [PLACEHOLDER: human-readable AWS account name].
ECS clusters / task definitions: defined in terraform/envs/qa/ and terraform/terraform_modules/{ecr-repo,ecs-cluster-task-def,lambda-function-and-ecs-task-def}/. State backend: terraform/envs/qa/backend.s3.tfbackend. [PLACEHOLDER: prod terraform env path — only qa is checked in here].
Build pipeline: dev_tools/build_light_worker.sh (build, smoke-test, push), launched by dev_tools/launch_models_scheduler.py --env qa --model <name> --wait (.claude/skills/qa-test/SKILL.md:32-66).
Role assumption: GitHub Actions assume an OIDC role for the nightly sync (.github/workflows/qa-to-prod-sync.yml:25-28 declares id-token: write). [PLACEHOLDER: role ARN / trust policy owner].
CodeArtifact: token sourced via scripts/codeartifact/get_pip_index_url.sh and the CA_INDEX_URL env var (.claude/skills/qa-test/SKILL.md:62).

CLI commands¶

All commands are reachable as uv run commodity-hindcast <cmd> or uv run python -m market_insights_models.src.commodity_hindcast.cli <cmd>. Source: market_insights_models/src/commodity_hindcast/cli.py.

Command	What it does	Source
`run features`	Build `fit.parquet` + `pred.parquet` from yields/weather/climo/NDVI/stress builders.	`cli.py:180`
`run hindcast`	Walk-forward CV folds + production fit; consumes existing feature parquets.	`cli.py:209`
`run all`	Hindcast + forecast (+ optional `--export`).	`cli.py:233`
`run fit-production`	Production model fit only.	`cli.py:334`
`run forecast-features`	Build forecast-time features at `--init-date`.	`cli.py:355`
`run forecast-predict`	Predict only (forecast features must already exist).	`cli.py:406`
`run forecast`	Point-in-time forecast (`--season-year`, `--init-date`) against an existing `--run-dir`.	`cli.py:450`
`run export`	Emit client-facing artefact to `MODEL_INGESTION_PATH`.	`cli.py:500`
`postprocess`	Aggregate to national, bias correction, conformal CIs.	`cli.py:532`
`evaluate`	Metrics + plots for an existing `run_dir`.	`cli.py:549`
`investigate`	Post-hoc scenario sweeps.	`cli.py:568`
`plots`	Regenerate PNGs only.	`cli.py:600`
`predict`	Stand-alone predict stage (no postprocess).	`cli.py:626`
`deliver`	Generate client-facing CSVs from postprocessed results.	`cli.py:654`

There is no cli run preflight subcommand — preflight (run/preflight.py) is invoked internally at the start of run features/run hindcast/run forecast. To validate access manually, run uv run commodity-hindcast run features --config <name> against a known-good config and stop after the preflight log lines.

Dashboard¶

Streamlit app: market_insights_models/src/commodity_hindcast/app/app.py.

Launch: uv run streamlit run market_insights_models/src/commodity_hindcast/app/app.py with INPUT_DATA_DIR set; on a fresh EC2 also set PYTHONPATH=<repo_root> (per MEMORY.md project_streamlit_app_launch.md). Override the runs source via HINDCAST_RUNS_DIR (app/_dashboard_config.py:41-49).

Hosted URL: [PLACEHOLDER: internal Streamlit deployment URL — currently dev-only].