Entity: Region¶

Definition¶

A geographic administrative unit at one of three levels: ADM0 (country), ADM1 (state/province), or ADM2 (county/district). Identified exclusively by a GeoIdentifier — a NewType("GeoIdentifier", str) alias — whose canonical form matches the pattern ADM0:[A-Z]{3}(/ADM1:[a-z...]+(/ADM2:[a-z...]+)?)?. The ADM level is inferred from the string structure; no separate level field is stored. ADM0 and ADM1 representations are derived by aggregation from the county-level panel.

Kind¶

Value object (NewType("GeoIdentifier", str) alias at lib/geo/identifiers.py:111). There is no Region class. The concept is fully encoded in the GeoIdentifier string and the AggregationLevel literal (lib/geo/aggregation.py).

Source of truth¶

market_insights_models/src/commodity_hindcast/lib/geo/identifiers.py:111 — GeoIdentifier = NewType("GeoIdentifier", str) declaration. lib/geo/identifiers.py:107 — GEO_ID_PATTERN regex defining the canonical ADM path format. lib/geo/identifiers.py:207 — make_geo_identifier(county, state, country_code) is the canonical factory; uses CommodityConfig.country_code at every pipeline call site.

Key attributes / structure¶

Attribute	Type	Notes
`GeoIdentifier` string	`NewType("GeoIdentifier", str)`	Canonical identity; e.g. `"ADM0:USA/ADM1:iowa/ADM2:polk"`
ADM level	inferred from string	`ADM0` only → country; `ADM0/ADM1` → state; `ADM0/ADM1/ADM2` → county
`AggregationLevel`	`Literal["ADM0","ADM1","ADM2"]`	Used by delivery and aggregation helpers
`included_geo_identifiers`	`frozenset[str]`	Top-N counties by production; persisted in `run_dir/included_geo_identifiers.txt`

Format rules (from GEO_ID_PATTERN): - ADM0 segment: ADM0: followed by exactly 3 uppercase ASCII letters (ISO-3 country code). - ADM1 segment: ADM1: followed by a lowercase name (letters, spaces, dots, apostrophes, hyphens). - ADM2 segment: ADM2: followed by a lowercase name with the same character set. - Country codes are always uppercased; name parts are always lowercased and ASCII-folded (doña ana → dona ana).

Cardinality: ~3,000–4,000 county-level GeoIdentifiers per commodity in the full panel; reduced to the top-95%-by-production subset included_geo_identifiers (~800–1,200) for model training. ADM0 and ADM1 identifiers are derived by aggregation at delivery time.

Lifecycle¶

Created: make_geo_identifier(county, state, country_code) (lib/geo/identifiers.py:207) is the canonical factory, called from builder modules during feature assembly. normalise_geo_identifier (lib/geo/identifiers.py:75) handles legacy or alternative formats before the identifier is stored. verify_geo_identifier (lib/geo/identifiers.py:114) validates a scalar value and returns a typed GeoIdentifier.

Consumed: - Feature parquets — geo_identifier is the second column of INDEX_COLS = ("year", "geo_identifier", "init_date") (features/builders/interface.py:21). - FIT stage — ExperimentResult.save_included_geo_identifiers() writes the top-N county frozenset to run_dir/included_geo_identifiers.txt (lib/results/run_result.py:157); load_included_geo_identifiers() reads it back at PREDICT. - Delivery — DeliveryRow.geo_identifier (delivery/schemas.py:132) is an identity column in every delivery CSV row; ADM0 and ADM1 rows carry aggregated identifiers.

Destroyed: Never destroyed; immutable once written to parquet or CSV.

Relationships to other entities¶

Commodity — scoped by — CommodityConfig.country_code determines the ADM0 segment of every identifier minted for a commodity run
Yield — indexes — every yield value is keyed by (geo_identifier, season_year, init_date) or (geo_identifier, season_year); ADM0 aggregation weights by area_harvested_ha
Fold — filtered per — included_geo_identifiers is a run-level constant shared across all folds; the inclusion set is fixed at FIT time

Concepts and pipelines that touch this entity¶

Pipeline: hindcast (P5) — FIT stage selects counties by production threshold; DELIVER aggregates county predictions to ADM0/ADM1/ADM2
Concept: geo identifier format (P5) — canonical ADM path format, normalisation rules, and validation
Concept: spatial aggregation (P5) — area-weighted aggregation from ADM2 → ADM1 → ADM0 at delivery time

PRs and commits¶

PR-360 — Adds Brazil soybean support; make_geo_identifier call sites must pass country_code="BRA" explicitly; validates that CONAB identifiers match the canonical ADM pattern
PR-339 — Structural refactor that moved lib/geo/identifiers.py to its canonical location and consolidated normalise_geo_identifier / assert_valid_geo_identifiers into one module
PR-345 — S3 path fixes; geo_identifier column joins between local and S3 parquets exposed a latent normalisation inconsistency in a zarr builder

Open questions¶

GeoIdentifier is a NewType but there is no enforcement that string values flowing through the pipeline have been validated via verify_geo_identifier — invalid strings can silently propagate until a join fails.
The included_geo_identifiers set is fixed at FIT time and shared across all folds; a county that crosses the production threshold late in the panel will be absent from early-fold training data.
ADM1-level GeoIdentifier strings (ADM0:USA/ADM1:iowa) are produced by aggregation helpers but are not minted by make_geo_identifier (which always produces ADM2); the aggregation path should be made explicit.
The GEO_ID_PATTERN regex allows dots, apostrophes, and hyphens in ADM1/ADM2 names but the normaliser may not handle all edge cases for international geographies (e.g. Brazilian municipality names with cedillas).