Skip to content

Entity: Region

Definition

A geographic administrative unit at one of three levels: ADM0 (country), ADM1 (state/province), or ADM2 (county/district). Identified exclusively by a GeoIdentifier — a NewType("GeoIdentifier", str) alias — whose canonical form matches the pattern ADM0:[A-Z]{3}(/ADM1:[a-z...]+(/ADM2:[a-z...]+)?)?. The ADM level is inferred from the string structure; no separate level field is stored. ADM0 and ADM1 representations are derived by aggregation from the county-level panel.

Kind

Value object (NewType("GeoIdentifier", str) alias at lib/geo/identifiers.py:111). There is no Region class. The concept is fully encoded in the GeoIdentifier string and the AggregationLevel literal (lib/geo/aggregation.py).

Source of truth

market_insights_models/src/commodity_hindcast/lib/geo/identifiers.py:111GeoIdentifier = NewType("GeoIdentifier", str) declaration. lib/geo/identifiers.py:107GEO_ID_PATTERN regex defining the canonical ADM path format. lib/geo/identifiers.py:207make_geo_identifier(county, state, country_code) is the canonical factory; uses CommodityConfig.country_code at every pipeline call site.

Key attributes / structure

Attribute Type Notes
GeoIdentifier string NewType("GeoIdentifier", str) Canonical identity; e.g. "ADM0:USA/ADM1:iowa/ADM2:polk"
ADM level inferred from string ADM0 only → country; ADM0/ADM1 → state; ADM0/ADM1/ADM2 → county
AggregationLevel Literal["ADM0","ADM1","ADM2"] Used by delivery and aggregation helpers
included_geo_identifiers frozenset[str] Top-N counties by production; persisted in run_dir/included_geo_identifiers.txt

Format rules (from GEO_ID_PATTERN): - ADM0 segment: ADM0: followed by exactly 3 uppercase ASCII letters (ISO-3 country code). - ADM1 segment: ADM1: followed by a lowercase name (letters, spaces, dots, apostrophes, hyphens). - ADM2 segment: ADM2: followed by a lowercase name with the same character set. - Country codes are always uppercased; name parts are always lowercased and ASCII-folded (doña ana → dona ana).

Cardinality: ~3,000–4,000 county-level GeoIdentifiers per commodity in the full panel; reduced to the top-95%-by-production subset included_geo_identifiers (~800–1,200) for model training. ADM0 and ADM1 identifiers are derived by aggregation at delivery time.

Lifecycle

Created: make_geo_identifier(county, state, country_code) (lib/geo/identifiers.py:207) is the canonical factory, called from builder modules during feature assembly. normalise_geo_identifier (lib/geo/identifiers.py:75) handles legacy or alternative formats before the identifier is stored. verify_geo_identifier (lib/geo/identifiers.py:114) validates a scalar value and returns a typed GeoIdentifier.

Consumed: - Feature parquets — geo_identifier is the second column of INDEX_COLS = ("year", "geo_identifier", "init_date") (features/builders/interface.py:21). - FIT stage — ExperimentResult.save_included_geo_identifiers() writes the top-N county frozenset to run_dir/included_geo_identifiers.txt (lib/results/run_result.py:157); load_included_geo_identifiers() reads it back at PREDICT. - Delivery — DeliveryRow.geo_identifier (delivery/schemas.py:132) is an identity column in every delivery CSV row; ADM0 and ADM1 rows carry aggregated identifiers.

Destroyed: Never destroyed; immutable once written to parquet or CSV.

Relationships to other entities

  • Commodity — scoped by — CommodityConfig.country_code determines the ADM0 segment of every identifier minted for a commodity run
  • Yield — indexes — every yield value is keyed by (geo_identifier, season_year, init_date) or (geo_identifier, season_year); ADM0 aggregation weights by area_harvested_ha
  • Fold — filtered per — included_geo_identifiers is a run-level constant shared across all folds; the inclusion set is fixed at FIT time

Concepts and pipelines that touch this entity

  • Pipeline: hindcast (P5) — FIT stage selects counties by production threshold; DELIVER aggregates county predictions to ADM0/ADM1/ADM2
  • Concept: geo identifier format (P5) — canonical ADM path format, normalisation rules, and validation
  • Concept: spatial aggregation (P5) — area-weighted aggregation from ADM2 → ADM1 → ADM0 at delivery time

PRs and commits

  • PR-360 — Adds Brazil soybean support; make_geo_identifier call sites must pass country_code="BRA" explicitly; validates that CONAB identifiers match the canonical ADM pattern
  • PR-339 — Structural refactor that moved lib/geo/identifiers.py to its canonical location and consolidated normalise_geo_identifier / assert_valid_geo_identifiers into one module
  • PR-345 — S3 path fixes; geo_identifier column joins between local and S3 parquets exposed a latent normalisation inconsistency in a zarr builder

Open questions

  • GeoIdentifier is a NewType but there is no enforcement that string values flowing through the pipeline have been validated via verify_geo_identifier — invalid strings can silently propagate until a join fails.
  • The included_geo_identifiers set is fixed at FIT time and shared across all folds; a county that crosses the production threshold late in the panel will be absent from early-fold training data.
  • ADM1-level GeoIdentifier strings (ADM0:USA/ADM1:iowa) are produced by aggregation helpers but are not minted by make_geo_identifier (which always produces ADM2); the aggregation path should be made explicit.
  • The GEO_ID_PATTERN regex allows dots, apostrophes, and hyphens in ADM1/ADM2 names but the normaliser may not handle all edge cases for international geographies (e.g. Brazilian municipality names with cedillas).