QA to prod artefact sync¶

Scope: commodity_hindcast only. This runbook covers the nightly
QA -> prod S3 sync that promotes weather-feature zarrs, NASS yields and
WASDE finals from s3://qa-treefera-greenprint-data/... to
s3://prod-treefera-greenprint-data/... so the prod hindcast/forecast
runs read the same inputs the QA cycle has signed off on.
The implementation is a thin wrapper around aws s3 sync ... --delete
--quiet. The wrapper is the qa-prod-sync console script defined in
treefera_market_insights/s3_sync_util/cli.py:38-128; the shell-out is
at treefera_market_insights/s3_sync_util/sync.py:175-194. The QA
prefix list lives in config/sync/qa_to_prod.yaml:12-22. The GitHub
Actions schedule that drives it is
.github/workflows/qa-to-prod-sync.yml:9-28. The --quiet flag was
added in commit 0e302410 (2026-05-07,
treefera_market_insights/s3_sync_util/sync.py:183); it silences
per-object copy: / delete: log lines but the wrapper still parses
summary stats from captured stdout/stderr.
1. When to use¶
Use this runbook when:
- A QA cycle has signed off — every commodity hindcast for the cycle has produced an exported delivery and the per-commodity QA report has been reviewed.
- A production refresh is required — the scheduled
0 8 * * *UTC run failed or was skipped (concurrency cancelled, OIDC outage, runner unavailable) and prod needs to catch up before the next forecast init. - A client deliverable is being promoted — the QA-side delivery CSV / parquet is the artefact of record and the same inputs must be visible to the prod pipeline before the prod export runs.
Do NOT use this runbook for routine dev work (QA reads from
platform-qa are fine without a sync), single-object promotes (the
wrapper syncs whole prefixes with --delete — use aws s3 cp
directly), or point-in-time backfills (the sync mirrors current QA
state only).
2. Preconditions¶
Confirm all of the following before triggering the workflow manually or running the fallback by hand:
- QA hindcast/forecast runs are FINISHED in MLflow (no
RUNNINGstatus). A live QA run can be writing tos3://qa-treefera-greenprint-data/weather/processed/...indirectly via the weather pipeline; syncing mid-write copies a half-rewritten zarr to prod. - QA delivery CSVs have been validated by the per-commodity QA report
(see the
qa-testskill output and the commodity's acceptance criteria). - Prod terraform env path
[PLACEHOLDER: prod terraform env path — onlyterraform/envs/qa/is checked in here; confirm with the platform team]is healthy and has the role permissions for any newly added buckets/prefixes. - ECR login works:
bash dev_tools/ecr-login.sh(dev_tools/ecr-login.sh:3-4—AWS_PROFILE=platform-qa, ECR registry390844765299.dkr.ecr.us-east-2.amazonaws.com). You do not need ECR for the sync itself, but a green login is the cheapest proxy for "AWS account access works". - AWS account
390844765299is reachable —aws sts get-caller-identity --profile platform-qareturns it. - The
prodGitHub Environment hasSENTRY_DSNset (.github/workflows/qa-to-prod-sync.yml:73-78); without it, sync errors will not surface in Sentry. - The schedule comment in
config/sync/qa_to_prod.yaml:6matches.github/workflows/qa-to-prod-sync.yml:12,22. The workflow header (.github/workflows/qa-to-prod-sync.yml:1-5) flags a preflight failure on mismatch; re-check both if either was just edited.
3. Procedure¶
Step 1 — Inspect the workflow¶
The canonical sync is .github/workflows/qa-to-prod-sync.yml. Key
points (line ranges in that file):
- Trigger:
schedule: cron "0 8 * * *"plusworkflow_dispatch(9-13). - Concurrency group
qa-to-prod-greenprint-syncwithcancel-in-progress: false— a second run queues rather than cancelling the in-flight one (15-17). - OIDC:
permissions: id-token: writeplus a chainedaws-actions/configure-aws-credentials@v4call assumingvars.AWS_ROLE(24-71). Regionus-east-2(20). - Entry point:
qa-prod-syncconsole script (80), defined byconfig/sync/qa_to_prod.yaml:8-9and implemented intreefera_market_insights/s3_sync_util/cli.py.
Step 2 — Read the sync configuration¶
QA prefixes are listed in config/sync/qa_to_prod.yaml:12-22:
per-commodity stress zarrs (corn, wheat, ghana cocoa), areal
aggregation and materialised climatology zarrs, per-commodity weather
indices zarrs plus the shared climo indices zarr,
s3://qa-treefera-greenprint-data/usda/nass/
(config/sync/qa_to_prod.yaml:21) and
s3://qa-treefera-greenprint-data/wasde/
(config/sync/qa_to_prod.yaml:22). The prod URI is computed by
qa- -> prod- string replacement
(treefera_market_insights/s3_sync_util/sync.py:197-219). New prefixes
require both an edit here and a bump to the sync IAM role; see the
"S3 buckets" row in drafts/access.md.
Step 3 — Trigger the sync¶
Preferred: gh workflow run qa-to-prod-sync.yml --ref main (or
GitHub UI -> Actions -> "QA to prod greenprint sync" -> "Run
workflow"; workflow_dispatch is at
.github/workflows/qa-to-prod-sync.yml:13). Watch with
gh run watch <run-id>. The underlying call is
aws s3 sync <qa> <prod> --delete --quiet per prefix, batched by
treefera_market_insights/s3_sync_util/cli.py:80-108. With --quiet
(commit 0e302410,
treefera_market_insights/s3_sync_util/sync.py:183) you see one
aws_s3_sync log line per prefix and a final qa_to_prod_done line
with copied and objects_seen counts.
Step 4 — Manual fallback¶
Use only if the workflow is offline (Actions incident, OIDC role
broken, runner regression). Local credentials must be able to write
to s3://prod-treefera-greenprint-data/ — platform-qa cannot;
assume the prod sync role manually
([PLACEHOLDER: prod sync role ARN; same role the workflow chains to
via vars.AWS_ROLE]). Then either run the wrapper:
uv run qa-prod-sync # uses config/sync/qa_to_prod.yaml
uv run qa-prod-sync --dry-run # validate prefixes without writing
…or hand-roll aws s3 sync per prefix:
aws s3 sync \
s3://qa-treefera-greenprint-data/usda/nass/ \
s3://prod-treefera-greenprint-data/usda/nass/ \
--delete --quiet
Prefer the wrapper: it iterates the configured prefix list, flips
qa- -> prod- consistently, and raises if any prefix came back
with no copies AND prod was not modified within 24h
(treefera_market_insights/s3_sync_util/cli.py:104-118).
4. Verification¶
After the workflow finishes (or the manual fallback exits 0):
- Open the GitHub Actions run log. The final
qa_to_prod_doneline reportscopied(objects copied) andobjects_seen(total). For a settled QA cycle,copiedis typically small (single-digit per zarr) plus the daily NASS/WASDE delta.[PLACEHOLDER: typical copied count after a clean QA cycle; record after first observed successful run]. - Spot-check a delivery CSV from prod:
aws s3 cp s3://prod-treefera-greenprint-data/... -to inspect headers and the latestas_of_daterow. Any QA-only NASS/WASDE value should now be visible from prod. - Confirm the downstream consumer is green — the prod
commodity_hindcastECS task should succeed on its next scheduled run; check[PLACEHOLDER: prod ECS task scheduler / CloudWatch dashboard]. - Sanity-check Sentry — no new
qa_to_prod_*events on theprodSentry environment.
5. Failure modes¶
5a. Workflow auth failure¶
Symptom: Configure AWS credentials or Assume deployment role step
fails with Could not load credentials or
AccessDenied: AssumeRoleWithWebIdentity.
Cause: OIDC trust between GitHub and the AWS role has drifted (role ARN renamed, trust policy claim changed, branch protection blocking the trusted ref).
Fix: (1) gh auth status locally as a basic sanity check. (2)
Inspect vars.ROLE_TO_ASSUME and vars.AWS_ROLE on the repo / prod
environment (.github/workflows/qa-to-prod-sync.yml:61,67). (3) If
the ARN looks fine, have [PLACEHOLDER: platform / SRE owner of the
prod sync role] re-apply the trust policy from terraform. (4)
Re-trigger via gh workflow run qa-to-prod-sync.yml --ref main.
5b. Partial sync (interrupted)¶
Symptom: the workflow is cancelled mid-way, or Run qa-prod-sync
exits non-zero from a transient aws s3 sync failure (S3 5xx, runner
network blip). Cause: aws s3 sync is idempotent — a re-run resumes
from the current diff — but the --quiet flag (commit 0e302410)
masks per-object copy: / delete: output, so the log shows nothing
between start and failure; you cannot eyeball "how far did it get".
Fix: (1) Re-trigger the workflow; no manual cleanup is needed. The
wrapper's had_copy_or_delete / dest_modified_within_24h guard
(treefera_market_insights/s3_sync_util/cli.py:104-108) treats a
no-op re-run within 24h as success rather than as a stale prefix.
(2) If you need per-object visibility for a debug run, run the
wrapper locally with --dry-run (uv run qa-prod-sync --dry-run) —
the --dryrun summary is not silenced by --quiet.
5c. Schema mismatch in prod¶
Symptom: the prod commodity_hindcast ECS task fails shortly after a
successful sync with a column-not-found, zarr-coordinate-mismatch, or
schema-validation error. Cause: the QA cycle introduced a schema
change (new feature column, renamed coordinate, region added) that
prod consumers are not yet deployed to handle. The sync did its job;
the deploy is out of date.
Fix: (1) Halt downstream prod runs immediately
([PLACEHOLDER: how to halt prod ECS task scheduler — pause the
EventBridge rule or scale to 0]). (2) Roll back prod artefacts
(section 6). (3) Coordinate with the team owning the schema change
to rebuild the prod consumer image before re-running the sync.
6. Rollback¶
[PLACEHOLDER: confirm S3 versioning is enabled ons3://prod-treefera-greenprint-data/— the sync uses--delete, so
without versioning a bad QA prefix overwrites prod with no recourse.
Owner to confirm:[PLACEHOLDER: platform team]. If versioning IS
enabled, restore viaaws s3api list-object-versionsfollowed byaws s3api copy-objectper object/version. If versioning is NOT
enabled, the rollback path is to fix QA and re-run the sync — there is
no point-in-time recovery for prod.]
Knobs to be aware of when planning a rollback:
config/sync/qa_to_prod.yaml:12-22— the prefix list. Removing a prefix from this file stops further mirroring, but does not undo a sync that already ran.treefera_market_insights/s3_sync_util/sync.py:183— theaws s3 syncargument list, including--deleteand--quiet.--deleteis the reason a bad QA state overwrites prod;[PLACEHOLDER: confirm whether emergency runs should drop --delete; current code path always passes it].drafts/access.md"S3 buckets" section — the read/write surface betweenqa-treefera-greenprint-dataandprod-treefera-greenprint-dataand which roles can write where.
After a rollback, re-run the verification steps in section 4 against the restored prod state before unpausing downstream consumers.