Skip to content

Access & credentials — commodity_hindcast

This document inventories every external system, credential, environment variable, S3 prefix, AWS surface and dashboard a new engineer needs in order to run the commodity_hindcast pipeline end-to-end. Nothing here lists a real secret value: each row points at the env var, secret name, AWS console resource or YAML field where the value is read from at runtime. If you cannot resolve one of these on day one, escalate to [PLACEHOLDER: TMI engineering lead] or [PLACEHOLDER: #market-insights Slack channel] rather than committing a workaround.

Env vars (mandatory)

Var Purpose Required by Default Source
INPUT_DATA_DIR Anchor for data_root; features/, runs/, mlruns.db and all relative config paths resolve under it. Fail-loud — empty/unset raises RuntimeError. Every CLI entry point, dashboard, eval shim. none (mandatory) market_insights_models/src/commodity_hindcast/config.py:50-66 (require_input_data_dir)
COMMODITY_HINDCAST_CONFIG Path to the resolved experiment YAML; written by the CLI before importing ExperimentConfig so pydantic-settings picks it up. All cli run … and stage commands. derived from --config market_insights_models/src/commodity_hindcast/cli.py:122; consumed at config.py:118
COMMODITY_HINDCAST_ROOT Override for monorepo root used when resolving relative config paths. Optional; only when invoking from outside the repo. _find_project_root() (walks up to pyproject.toml) market_insights_models/src/commodity_hindcast/config.py:110-113
CROP_YIELD_GEOBOUNDARIES_FILE Geometry parquet used by universe_coverage and choropleth panels. Non-US runs need the all-countries file. Diagnostics plots. [PLACEHOLDER: bundled default in crop_yield] market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:115-126
COMMODITY_HINDCAST_BOUNDARIES_FILE County-level boundary override for hindcast plots. Diagnostics plots (county panels). none market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:42,49
COMMODITY_HINDCAST_STATE_BOUNDARIES_FILE State-level boundary override for hindcast plots. Diagnostics plots (state panels). none market_insights_models/src/commodity_hindcast/diagnostics/plots/lib/geo.py:43,62
COMMODITY_HINDCAST_FORCE_STRESS_ASSEMBLY Force YTD stress zarr re-assembly even if cached. features/builders/stress.py (rare debug use). false market_insights_models/src/commodity_hindcast/features/builders/stress.py:31
HINDCAST_RUNS_DIR Override the runs-tree path the Streamlit dashboard reads from. app/app.py. data_root / "runs" market_insights_models/src/commodity_hindcast/app/_dashboard_config.py:41-49
PIPELINE_RUN_ID Pipeline UUID stamped onto delivery exports; required when --export is set. cli run export, delivery/export.py. none (must be set by ECS task) market_insights_models/src/commodity_hindcast/cli.py:137-141; delivery/export.py:362,372
MODEL_INGESTION_PATH S3 destination for the export artefact; required when --export is set. cli run export, delivery/export.py. none (must be set by ECS task) market_insights_models/src/commodity_hindcast/cli.py:137-141; delivery/export.py:362,373
AWS_PROFILE (platform-qa) Selects the AWS credentials profile used for ECR login, ECS task launch and S3 reads against the QA buckets during local dev. dev_tools/ecr-login.sh, dev_tools/build_light_worker.sh, the QA-test workflow. none dev_tools/ecr-login.sh:3; .claude/skills/qa-test/SKILL.md:50

Note: standard AWS_REGION / AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN are read by boto3 and cloudpathlib; the code does not call os.getenv on them directly. Use aws sso login --profile platform-qa rather than long-lived keys. The QA region is us-east-2 (dev_tools/ecr-login.sh:4; .github/workflows/qa-to-prod-sync.yml:21).

S3 buckets

Bucket Purpose Read/Write Resolved by Source
s3://{env}-treefera-greenprint-data/weather/processed/{indices,stress,climo_indices,climatology,areal_aggregation}/... Weather indices, YTD stress, materialised climo, areal aggregation zarrs consumed by every commodity. {env} expands to qa or prod via expand_env_template. Read expand_env_template in treefera_market_insights.shared.utils.env_templates, called from resolve_data_path market_insights_models/src/commodity_hindcast/configs/corn_usa.yaml:230,242,249,256,346,348; wheat_usa.yaml:277,291,302,309,380,381; cotton_usa.yaml:18,19,225,232; soybeans_usa.yaml:223; config.py:82-100
s3://qa-treefera-greenprint-data/usda/nass/, s3://qa-treefera-greenprint-data/wasde/ NASS yields and WASDE final values feeding the hindcast. Synced QA -> prod nightly. Read Resolved against data_root via the data symlink at the repo root, or via s3:// URIs in configs. config/sync/qa_to_prod.yaml:18-19
s3://qa-market-insights-pipeline-data/... ECS pipeline working data: usda_<commodity>_yield/<uuid>/runs/<stamp>_<key>/. Each ECS task receives MODEL_INGESTION_PATH rooted here. Read/write (per-task) delivery/export.py writes the export artefact; the example URI is the canonical layout. market_insights_models/src/commodity_hindcast/delivery/export.py:403
s3://prod-treefera-greenprint-data/..., s3://prod-market-insights-pipeline-data/... Production mirror of the above. Read/write Same {env} template expansion. .github/workflows/qa-to-prod-sync.yml:9-23; config/sync/qa_to_prod.yaml:5

The full list of QA prefixes covered by the nightly sync lives in config/sync/qa_to_prod.yaml. New paths must be added there and to the sync IAM role before they will reach prod.

MLflow

  • Tracking URI default: sqlite:///mlruns.db (per-config; corn_usa.yaml:9, soybeans_usa.yaml:10, wheat_usa.yaml:11, cotton_usa.yaml:10, soybeans_bra.yaml:20).
  • Backing file location: anchored at data_root (= INPUT_DATA_DIR) by tracking_uri_anchored (market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py:43-78). On a dev EC2 with INPUT_DATA_DIR=/data/processing/github/treefera-market-insights, that is /<repo_root>/mlruns.db.
  • When data_root is an s3:// URI, the relative sqlite URI passes through unchanged and a warning is logged. For ECS, set mlflow_tracking_uri to an absolute local path (e.g. sqlite:////tmp/mlruns.db) or an HTTP(S) tracking server.
  • MLFLOW_TRACKING_URI env var is not read by the code — set the URI in the YAML, not the environment.
  • Experiment naming: mlflow.set_experiment(config.commodity.experiment_key) (lib/tracking/decorators.py:84); experiment_key is <commodity>_<iso3> (e.g. corn_usa, soybeans_bra) per config.py:433.
  • Run naming: <stage>_<commodity>_<stamp> (README.md:144).
  • Concurrent runs of the same commodity will deadlock SQLite — see MEMORY.md.
  • UI: [PLACEHOLDER: hosted MLflow UI URL — currently file-only sqlite, no shared server].

AWS resources

  • ECR registry: 390844765299.dkr.ecr.us-east-2.amazonaws.com (dev_tools/ecr-login.sh:4; .claude/skills/qa-test/SKILL.md:53).
  • QA ECR repo: qa-market-insights-repo; image tag qa-light_worker-latest (.claude/skills/qa-test/references/troubleshooting.md:45,59,62).
  • Region: us-east-2 (.github/workflows/qa-to-prod-sync.yml:21).
  • Account ID: 390844765299 (the ECR registry prefix). [PLACEHOLDER: human-readable AWS account name].
  • ECS clusters / task definitions: defined in terraform/envs/qa/ and terraform/terraform_modules/{ecr-repo,ecs-cluster-task-def,lambda-function-and-ecs-task-def}/. State backend: terraform/envs/qa/backend.s3.tfbackend. [PLACEHOLDER: prod terraform env path — only qa is checked in here].
  • Build pipeline: dev_tools/build_light_worker.sh (build, smoke-test, push), launched by dev_tools/launch_models_scheduler.py --env qa --model <name> --wait (.claude/skills/qa-test/SKILL.md:32-66).
  • Role assumption: GitHub Actions assume an OIDC role for the nightly sync (.github/workflows/qa-to-prod-sync.yml:25-28 declares id-token: write). [PLACEHOLDER: role ARN / trust policy owner].
  • CodeArtifact: token sourced via scripts/codeartifact/get_pip_index_url.sh and the CA_INDEX_URL env var (.claude/skills/qa-test/SKILL.md:62).

CLI commands

All commands are reachable as uv run commodity-hindcast <cmd> or uv run python -m market_insights_models.src.commodity_hindcast.cli <cmd>. Source: market_insights_models/src/commodity_hindcast/cli.py.

Command What it does Source
run features Build fit.parquet + pred.parquet from yields/weather/climo/NDVI/stress builders. cli.py:180
run hindcast Walk-forward CV folds + production fit; consumes existing feature parquets. cli.py:209
run all Hindcast + forecast (+ optional --export). cli.py:233
run fit-production Production model fit only. cli.py:334
run forecast-features Build forecast-time features at --init-date. cli.py:355
run forecast-predict Predict only (forecast features must already exist). cli.py:406
run forecast Point-in-time forecast (--season-year, --init-date) against an existing --run-dir. cli.py:450
run export Emit client-facing artefact to MODEL_INGESTION_PATH. cli.py:500
postprocess Aggregate to national, bias correction, conformal CIs. cli.py:532
evaluate Metrics + plots for an existing run_dir. cli.py:549
investigate Post-hoc scenario sweeps. cli.py:568
plots Regenerate PNGs only. cli.py:600
predict Stand-alone predict stage (no postprocess). cli.py:626
deliver Generate client-facing CSVs from postprocessed results. cli.py:654

There is no cli run preflight subcommand — preflight (run/preflight.py) is invoked internally at the start of run features/run hindcast/run forecast. To validate access manually, run uv run commodity-hindcast run features --config <name> against a known-good config and stop after the preflight log lines.

The Makefile exposes make features|hindcast|fit-production|forecast|postprocess|evaluate|investigate|plots|deliver, each parameterised by EXPERIMENT_KEY=<commodity>_<iso3> and run from REPO_ROOT (market_insights_models/src/commodity_hindcast/Makefile:1-96).

Dashboard

Streamlit app: market_insights_models/src/commodity_hindcast/app/app.py.

Launch: uv run streamlit run market_insights_models/src/commodity_hindcast/app/app.py with INPUT_DATA_DIR set; on a fresh EC2 also set PYTHONPATH=<repo_root> (per MEMORY.md project_streamlit_app_launch.md). Override the runs source via HINDCAST_RUNS_DIR (app/_dashboard_config.py:41-49).

Hosted URL: [PLACEHOLDER: internal Streamlit deployment URL — currently dev-only].

First-day checklist

  • aws sso login --profile platform-qa and confirm aws sts get-caller-identity --profile platform-qa returns account 390844765299.
  • Export INPUT_DATA_DIR=<absolute path> (use the repo root on a dev EC2; the container mount on ECS).
  • aws s3 ls s3://qa-treefera-greenprint-data/weather/processed/indices/ --profile platform-qa — sanity-check read access.
  • aws s3 ls s3://qa-market-insights-pipeline-data/ --profile platform-qa — sanity-check write target listing.
  • aws ecr get-login-password --region us-east-2 --profile platform-qa | docker login --username AWS --password-stdin 390844765299.dkr.ecr.us-east-2.amazonaws.com.
  • uv run commodity-hindcast run features --config corn_usa against a fresh INPUT_DATA_DIR to exercise preflight + MLflow bootstrap (creates mlruns.db).
  • Open mlruns.db with mlflow ui --backend-store-uri sqlite:///$INPUT_DATA_DIR/mlruns.db to verify experiment creation.
  • [PLACEHOLDER: join Slack channels — #market-insights, #data-platform, on-call rotation].
  • [PLACEHOLDER: confirm GitHub team membership — treefera/market-insights-engineers].
  • [PLACEHOLDER: 1Password / vault access for any rotated keys].