Skip to content

QA to prod artefact sync

Failure-mode flowchart — QA to prod sync

Scope: commodity_hindcast only. This runbook covers the nightly QA -> prod S3 sync that promotes weather-feature zarrs, NASS yields and WASDE finals from s3://qa-treefera-greenprint-data/... to s3://prod-treefera-greenprint-data/... so the prod hindcast/forecast runs read the same inputs the QA cycle has signed off on.

The implementation is a thin wrapper around aws s3 sync ... --delete --quiet. The wrapper is the qa-prod-sync console script defined in treefera_market_insights/s3_sync_util/cli.py:38-128; the shell-out is at treefera_market_insights/s3_sync_util/sync.py:175-194. The QA prefix list lives in config/sync/qa_to_prod.yaml:12-22. The GitHub Actions schedule that drives it is .github/workflows/qa-to-prod-sync.yml:9-28. The --quiet flag was added in commit 0e302410 (2026-05-07, treefera_market_insights/s3_sync_util/sync.py:183); it silences per-object copy: / delete: log lines but the wrapper still parses summary stats from captured stdout/stderr.

1. When to use

Use this runbook when:

  • A QA cycle has signed off — every commodity hindcast for the cycle has produced an exported delivery and the per-commodity QA report has been reviewed.
  • A production refresh is required — the scheduled 0 8 * * * UTC run failed or was skipped (concurrency cancelled, OIDC outage, runner unavailable) and prod needs to catch up before the next forecast init.
  • A client deliverable is being promoted — the QA-side delivery CSV / parquet is the artefact of record and the same inputs must be visible to the prod pipeline before the prod export runs.

Do NOT use this runbook for routine dev work (QA reads from platform-qa are fine without a sync), single-object promotes (the wrapper syncs whole prefixes with --delete — use aws s3 cp directly), or point-in-time backfills (the sync mirrors current QA state only).

2. Preconditions

Confirm all of the following before triggering the workflow manually or running the fallback by hand:

  • QA hindcast/forecast runs are FINISHED in MLflow (no RUNNING status). A live QA run can be writing to s3://qa-treefera-greenprint-data/weather/processed/... indirectly via the weather pipeline; syncing mid-write copies a half-rewritten zarr to prod.
  • QA delivery CSVs have been validated by the per-commodity QA report (see the qa-test skill output and the commodity's acceptance criteria).
  • Prod terraform env path [PLACEHOLDER: prod terraform env path — onlyterraform/envs/qa/is checked in here; confirm with the platform team] is healthy and has the role permissions for any newly added buckets/prefixes.
  • ECR login works: bash dev_tools/ecr-login.sh (dev_tools/ecr-login.sh:3-4AWS_PROFILE=platform-qa, ECR registry 390844765299.dkr.ecr.us-east-2.amazonaws.com). You do not need ECR for the sync itself, but a green login is the cheapest proxy for "AWS account access works".
  • AWS account 390844765299 is reachable — aws sts get-caller-identity --profile platform-qa returns it.
  • The prod GitHub Environment has SENTRY_DSN set (.github/workflows/qa-to-prod-sync.yml:73-78); without it, sync errors will not surface in Sentry.
  • The schedule comment in config/sync/qa_to_prod.yaml:6 matches .github/workflows/qa-to-prod-sync.yml:12,22. The workflow header (.github/workflows/qa-to-prod-sync.yml:1-5) flags a preflight failure on mismatch; re-check both if either was just edited.

3. Procedure

Step 1 — Inspect the workflow

The canonical sync is .github/workflows/qa-to-prod-sync.yml. Key points (line ranges in that file):

  • Trigger: schedule: cron "0 8 * * *" plus workflow_dispatch (9-13).
  • Concurrency group qa-to-prod-greenprint-sync with cancel-in-progress: false — a second run queues rather than cancelling the in-flight one (15-17).
  • OIDC: permissions: id-token: write plus a chained aws-actions/configure-aws-credentials@v4 call assuming vars.AWS_ROLE (24-71). Region us-east-2 (20).
  • Entry point: qa-prod-sync console script (80), defined by config/sync/qa_to_prod.yaml:8-9 and implemented in treefera_market_insights/s3_sync_util/cli.py.

Step 2 — Read the sync configuration

QA prefixes are listed in config/sync/qa_to_prod.yaml:12-22: per-commodity stress zarrs (corn, wheat, ghana cocoa), areal aggregation and materialised climatology zarrs, per-commodity weather indices zarrs plus the shared climo indices zarr, s3://qa-treefera-greenprint-data/usda/nass/ (config/sync/qa_to_prod.yaml:21) and s3://qa-treefera-greenprint-data/wasde/ (config/sync/qa_to_prod.yaml:22). The prod URI is computed by qa- -> prod- string replacement (treefera_market_insights/s3_sync_util/sync.py:197-219). New prefixes require both an edit here and a bump to the sync IAM role; see the "S3 buckets" row in drafts/access.md.

Step 3 — Trigger the sync

Preferred: gh workflow run qa-to-prod-sync.yml --ref main (or GitHub UI -> Actions -> "QA to prod greenprint sync" -> "Run workflow"; workflow_dispatch is at .github/workflows/qa-to-prod-sync.yml:13). Watch with gh run watch <run-id>. The underlying call is aws s3 sync <qa> <prod> --delete --quiet per prefix, batched by treefera_market_insights/s3_sync_util/cli.py:80-108. With --quiet (commit 0e302410, treefera_market_insights/s3_sync_util/sync.py:183) you see one aws_s3_sync log line per prefix and a final qa_to_prod_done line with copied and objects_seen counts.

Step 4 — Manual fallback

Use only if the workflow is offline (Actions incident, OIDC role broken, runner regression). Local credentials must be able to write to s3://prod-treefera-greenprint-data/platform-qa cannot; assume the prod sync role manually ([PLACEHOLDER: prod sync role ARN; same role the workflow chains to via vars.AWS_ROLE]). Then either run the wrapper:

uv run qa-prod-sync               # uses config/sync/qa_to_prod.yaml
uv run qa-prod-sync --dry-run     # validate prefixes without writing

…or hand-roll aws s3 sync per prefix:

aws s3 sync \
  s3://qa-treefera-greenprint-data/usda/nass/ \
  s3://prod-treefera-greenprint-data/usda/nass/ \
  --delete --quiet

Prefer the wrapper: it iterates the configured prefix list, flips qa- -> prod- consistently, and raises if any prefix came back with no copies AND prod was not modified within 24h (treefera_market_insights/s3_sync_util/cli.py:104-118).

4. Verification

After the workflow finishes (or the manual fallback exits 0):

  • Open the GitHub Actions run log. The final qa_to_prod_done line reports copied (objects copied) and objects_seen (total). For a settled QA cycle, copied is typically small (single-digit per zarr) plus the daily NASS/WASDE delta. [PLACEHOLDER: typical copied count after a clean QA cycle; record after first observed successful run].
  • Spot-check a delivery CSV from prod: aws s3 cp s3://prod-treefera-greenprint-data/... - to inspect headers and the latest as_of_date row. Any QA-only NASS/WASDE value should now be visible from prod.
  • Confirm the downstream consumer is green — the prod commodity_hindcast ECS task should succeed on its next scheduled run; check [PLACEHOLDER: prod ECS task scheduler / CloudWatch dashboard].
  • Sanity-check Sentry — no new qa_to_prod_* events on the prod Sentry environment.

5. Failure modes

5a. Workflow auth failure

Symptom: Configure AWS credentials or Assume deployment role step fails with Could not load credentials or AccessDenied: AssumeRoleWithWebIdentity.

Cause: OIDC trust between GitHub and the AWS role has drifted (role ARN renamed, trust policy claim changed, branch protection blocking the trusted ref).

Fix: (1) gh auth status locally as a basic sanity check. (2) Inspect vars.ROLE_TO_ASSUME and vars.AWS_ROLE on the repo / prod environment (.github/workflows/qa-to-prod-sync.yml:61,67). (3) If the ARN looks fine, have [PLACEHOLDER: platform / SRE owner of the prod sync role] re-apply the trust policy from terraform. (4) Re-trigger via gh workflow run qa-to-prod-sync.yml --ref main.

5b. Partial sync (interrupted)

Symptom: the workflow is cancelled mid-way, or Run qa-prod-sync exits non-zero from a transient aws s3 sync failure (S3 5xx, runner network blip). Cause: aws s3 sync is idempotent — a re-run resumes from the current diff — but the --quiet flag (commit 0e302410) masks per-object copy: / delete: output, so the log shows nothing between start and failure; you cannot eyeball "how far did it get".

Fix: (1) Re-trigger the workflow; no manual cleanup is needed. The wrapper's had_copy_or_delete / dest_modified_within_24h guard (treefera_market_insights/s3_sync_util/cli.py:104-108) treats a no-op re-run within 24h as success rather than as a stale prefix. (2) If you need per-object visibility for a debug run, run the wrapper locally with --dry-run (uv run qa-prod-sync --dry-run) — the --dryrun summary is not silenced by --quiet.

5c. Schema mismatch in prod

Symptom: the prod commodity_hindcast ECS task fails shortly after a successful sync with a column-not-found, zarr-coordinate-mismatch, or schema-validation error. Cause: the QA cycle introduced a schema change (new feature column, renamed coordinate, region added) that prod consumers are not yet deployed to handle. The sync did its job; the deploy is out of date.

Fix: (1) Halt downstream prod runs immediately ([PLACEHOLDER: how to halt prod ECS task scheduler — pause the EventBridge rule or scale to 0]). (2) Roll back prod artefacts (section 6). (3) Coordinate with the team owning the schema change to rebuild the prod consumer image before re-running the sync.

6. Rollback

[PLACEHOLDER: confirm S3 versioning is enabled ons3://prod-treefera-greenprint-data/— the sync uses--delete, so without versioning a bad QA prefix overwrites prod with no recourse. Owner to confirm:[PLACEHOLDER: platform team]. If versioning IS enabled, restore viaaws s3api list-object-versionsfollowed byaws s3api copy-objectper object/version. If versioning is NOT enabled, the rollback path is to fix QA and re-run the sync — there is no point-in-time recovery for prod.]

Knobs to be aware of when planning a rollback:

  • config/sync/qa_to_prod.yaml:12-22 — the prefix list. Removing a prefix from this file stops further mirroring, but does not undo a sync that already ran.
  • treefera_market_insights/s3_sync_util/sync.py:183 — the aws s3 sync argument list, including --delete and --quiet. --delete is the reason a bad QA state overwrites prod; [PLACEHOLDER: confirm whether emergency runs should drop --delete; current code path always passes it].
  • drafts/access.md "S3 buckets" section — the read/write surface between qa-treefera-greenprint-data and prod-treefera-greenprint-data and which roles can write where.

After a rollback, re-run the verification steps in section 4 against the restored prod state before unpausing downstream consumers.