Entity: Region¶
Definition¶
A geographic administrative unit at one of three levels: ADM0 (country), ADM1 (state/province), or ADM2 (county/district). Identified exclusively by a GeoIdentifier — a NewType("GeoIdentifier", str) alias — whose canonical form matches the pattern ADM0:[A-Z]{3}(/ADM1:[a-z...]+(/ADM2:[a-z...]+)?)?. The ADM level is inferred from the string structure; no separate level field is stored. ADM0 and ADM1 representations are derived by aggregation from the county-level panel.
Kind¶
Value object (NewType("GeoIdentifier", str) alias at lib/geo/identifiers.py:111). There is no Region class. The concept is fully encoded in the GeoIdentifier string and the AggregationLevel literal (lib/geo/aggregation.py).
Source of truth¶
market_insights_models/src/commodity_hindcast/lib/geo/identifiers.py:111 — GeoIdentifier = NewType("GeoIdentifier", str) declaration.
lib/geo/identifiers.py:107 — GEO_ID_PATTERN regex defining the canonical ADM path format.
lib/geo/identifiers.py:207 — make_geo_identifier(county, state, country_code) is the canonical factory; uses CommodityConfig.country_code at every pipeline call site.
Key attributes / structure¶
| Attribute | Type | Notes |
|---|---|---|
GeoIdentifier string |
NewType("GeoIdentifier", str) |
Canonical identity; e.g. "ADM0:USA/ADM1:iowa/ADM2:polk" |
| ADM level | inferred from string | ADM0 only → country; ADM0/ADM1 → state; ADM0/ADM1/ADM2 → county |
AggregationLevel |
Literal["ADM0","ADM1","ADM2"] |
Used by delivery and aggregation helpers |
included_geo_identifiers |
frozenset[str] |
Top-N counties by production; persisted in run_dir/included_geo_identifiers.txt |
Format rules (from GEO_ID_PATTERN):
- ADM0 segment: ADM0: followed by exactly 3 uppercase ASCII letters (ISO-3 country code).
- ADM1 segment: ADM1: followed by a lowercase name (letters, spaces, dots, apostrophes, hyphens).
- ADM2 segment: ADM2: followed by a lowercase name with the same character set.
- Country codes are always uppercased; name parts are always lowercased and ASCII-folded (doña ana → dona ana).
Cardinality: ~3,000–4,000 county-level GeoIdentifiers per commodity in the full panel; reduced to the top-95%-by-production subset included_geo_identifiers (~800–1,200) for model training. ADM0 and ADM1 identifiers are derived by aggregation at delivery time.
Lifecycle¶
Created: make_geo_identifier(county, state, country_code) (lib/geo/identifiers.py:207) is the canonical factory, called from builder modules during feature assembly. normalise_geo_identifier (lib/geo/identifiers.py:75) handles legacy or alternative formats before the identifier is stored. verify_geo_identifier (lib/geo/identifiers.py:114) validates a scalar value and returns a typed GeoIdentifier.
Consumed:
- Feature parquets — geo_identifier is the second column of INDEX_COLS = ("year", "geo_identifier", "init_date") (features/builders/interface.py:21).
- FIT stage — ExperimentResult.save_included_geo_identifiers() writes the top-N county frozenset to run_dir/included_geo_identifiers.txt (lib/results/run_result.py:157); load_included_geo_identifiers() reads it back at PREDICT.
- Delivery — DeliveryRow.geo_identifier (delivery/schemas.py:132) is an identity column in every delivery CSV row; ADM0 and ADM1 rows carry aggregated identifiers.
Destroyed: Never destroyed; immutable once written to parquet or CSV.
Relationships to other entities¶
- Commodity — scoped by —
CommodityConfig.country_codedetermines the ADM0 segment of every identifier minted for a commodity run - Yield — indexes — every yield value is keyed by
(geo_identifier, season_year, init_date)or(geo_identifier, season_year); ADM0 aggregation weights byarea_harvested_ha - Fold — filtered per —
included_geo_identifiersis a run-level constant shared across all folds; the inclusion set is fixed at FIT time
Concepts and pipelines that touch this entity¶
- Pipeline: hindcast (P5) — FIT stage selects counties by production threshold; DELIVER aggregates county predictions to ADM0/ADM1/ADM2
- Concept: geo identifier format (P5) — canonical ADM path format, normalisation rules, and validation
- Concept: spatial aggregation (P5) — area-weighted aggregation from ADM2 → ADM1 → ADM0 at delivery time
PRs and commits¶
- PR-360 — Adds Brazil soybean support;
make_geo_identifiercall sites must passcountry_code="BRA"explicitly; validates that CONAB identifiers match the canonical ADM pattern - PR-339 — Structural refactor that moved
lib/geo/identifiers.pyto its canonical location and consolidatednormalise_geo_identifier/assert_valid_geo_identifiersinto one module - PR-345 — S3 path fixes;
geo_identifiercolumn joins between local and S3 parquets exposed a latent normalisation inconsistency in a zarr builder
Open questions¶
GeoIdentifieris aNewTypebut there is no enforcement that string values flowing through the pipeline have been validated viaverify_geo_identifier— invalid strings can silently propagate until a join fails.- The
included_geo_identifiersset is fixed at FIT time and shared across all folds; a county that crosses the production threshold late in the panel will be absent from early-fold training data. - ADM1-level
GeoIdentifierstrings (ADM0:USA/ADM1:iowa) are produced by aggregation helpers but are not minted bymake_geo_identifier(which always produces ADM2); the aggregation path should be made explicit. - The
GEO_ID_PATTERNregex allows dots, apostrophes, and hyphens in ADM1/ADM2 names but the normaliser may not handle all edge cases for international geographies (e.g. Brazilian municipality names with cedillas).