Skip to content

ADM Levels

What it is

The commodity_hindcast pipeline represents every geographic unit — from a county to a country — as a string called a geo_identifier. This string encodes both the geographic entity and its administrative level (ADM) in a single human-readable path. There are three levels:

Level Name Example
ADM0 Country ADM0:USA
ADM1 State / province ADM0:USA/ADM1:iowa
ADM2 County / district ADM0:USA/ADM1:iowa/ADM2:johnson

The canonical format is defined in DESIGN.md Clause 17:

"all geographic identifiers normalised to lowercase ADM path ADM0:usa/ADM1:{state}/ADM2:{county} — no FIPS codes, no mixed case"

(Note: Clause 17 shows ADM0:usa in lower case in the example, but the validation regex enforces an uppercase three-letter country code: ADM0:[A-Z]{3}. The clause example is informal shorthand; the enforced canonical form is ADM0:USA/ADM1:iowa/....)

Where it lives

Symbol File Line
GEO_ID_PATTERN lib/geo/identifiers.py 107
GeoIdentifier (NewType) lib/geo/identifiers.py 111
normalise_geo_identifier lib/geo/identifiers.py 75
verify_geo_identifier lib/geo/identifiers.py 114
make_geo_identifier lib/geo/identifiers.py 207
apply_geo_lookup lib/geo/identifiers.py 161
AggregationLevel lib/geo/aggregation.py 10
aggregate_weighted_frame lib/geo/aggregation.py 96
walk_forward_preds_to_delivery_rows delivery/conversions.py 182

The GeoIdentifier type

GeoIdentifier (identifiers.py:111) is a NewType("GeoIdentifier", str) — a typed string that has passed pattern validation. It is returned by verify_geo_identifier and make_geo_identifier. Using NewType rather than a plain str makes the distinction between validated and unvalidated identifiers visible in type hints.

The regex GEO_ID_PATTERN (identifiers.py:107) accepts exactly three formats:

ADM0:[A-Z]{3}
ADM0:[A-Z]{3}/ADM1:[a-z][a-z .'\-]*
ADM0:[A-Z]{3}/ADM1:[a-z][a-z .'\-]*/ADM2:[a-z][a-z .'\-]*

The level is therefore implicit in the string structure: the presence of /ADM2: signals ADM2; a terminal /ADM1: without /ADM2: signals ADM1; a bare ADM0: with no slash signals ADM0.

make_geo_identifier — construction from raw survey fields

identifiers.py:207 is the canonical constructor for ADM2 identifiers from NASS-style survey data:

def make_geo_identifier(county: str, state: str, country_code: str) -> GeoIdentifier:

Its behaviour:

  1. Upper-cases country_code and validates it against r"[A-Z]{3}" — a non-ISO-3 code raises ValueError immediately (fails fast here rather than producing a non-canonical identifier that only fails later inside verify_geo_identifier).
  2. Passes state and county through _normalise_name, which lowercases, strips, ASCII-folds accented characters, and collapses whitespace.
  3. Calls verify_geo_identifier on the assembled path to produce a typed GeoIdentifier.

Call sites pass cfg.commodity.country_code as the country_code argument; no caller hardcodes an ISO-3 code inline. The NASS loader (lib/reference_data/nass.py:31) calls this function to turn the raw NASS county and state columns into canonical ADM2 identifiers.

Normalisation — normalise_geo_identifier

identifiers.py:75 normalises a possibly-malformed identifier to canonical ASCII form. It handles:

  • Accented characters via _to_ascii (NFD decomposition + accent strip) and a hand-curated _SPECIAL_CHAR_MAP (identifiers.py:29) for characters that do not decompose cleanly (ł → l, ø → o, æ → ae, ß → ss, etc.).
  • Mixed case: ADM0 segments are uppercased; ADM1 and ADM2 segments are lowercased.
  • Whitespace: collapsed to single spaces.

Example: "ADM0:BRA/ADM1:São Paulo/ADM2:Campinas""ADM0:BRA/ADM1:sao paulo/ADM2:campinas".

The vectorised form normalise_geo_identifiers accepts a pd.Series or np.ndarray.

apply_geo_lookup — hash-to-ADM remapping

identifiers.py:161 maps an array of raw geo values (typically opaque hash IDs from a zarr) to canonical ADM identifiers via a lookup CSV:

def apply_geo_lookup(geo_values, geo_lookup, keycol="geo_id", valcol="identifier") -> np.ndarray:

It uses pandas.Series.map for O(n) hash-table lookup. A special edge case is handled: if all rows miss the keycol lookup but every input value already appears in the valcol column, the inputs are returned unchanged — this covers zarrs that have already been remapped to ADM identifiers (e.g. after a preprocessing step).

AggregationLevel — typed ADM level for delivery

lib/geo/aggregation.py:10 defines:

AggregationLevel = Literal["ADM0", "ADM1", "ADM2"]

This is the type used by the delivery and diagnostics layers to specify which level to aggregate to. It appears in aggregate_weighted_frame and walk_forward_preds_to_delivery_rows.

Delivery aggregation: ADM2 → ADM1 → ADM0

delivery/conversions.py:182 (walk_forward_preds_to_delivery_rows) aggregates walk- forward predictions from their native ADM2 county resolution up to the requested delivery level:

if level == "ADM2":
    # No aggregation needed — keep county rows directly
    agg_pd = combined[keep_cols].copy()
else:
    agg_pd = _aggregate_to_level(
        combined,
        value_cols,
        level=level,
        group_cols=["year", "init_date"],
        area_col=area_col,
    )

The _aggregate_to_level helper (conversions.py:55) delegates to aggregate_weighted_frame (lib/geo/aggregation.py:96), which strips the trailing ADM segment(s) from each geo_identifier to compute the target-level group key, then applies area_weighted_mean within each (year, init_date, target_geo) group.

For example, when level="ADM1", the ADM2 identifier ADM0:USA/ADM1:iowa/ADM2:johnson is reduced to ADM0:USA/ADM1:iowa and all Johnson County rows are merged into the Iowa group weighted by area_harvested_ha.

When level="ADM0", the identifier is further reduced to ADM0:USA and state-level groups are merged into the national aggregate.

Unweighted mean of yields is explicitly forbidden (DESIGN.md). area_weighted_mean raises ValueError when a group has non-NaN yield but all-NaN area rather than silently falling back to an unweighted mean.

National-level benchmark columns

At level="ADM0", walk_forward_preds_to_delivery_rows also joins two NASS benchmark series from the full NASS county universe (not just the production-filtered subset used for the model):

  • nass_actual_area_weighted_all — survey yield, area-weighted per year
  • nass_actual_prod_div_area_all — national production ÷ total area, per year

These are loaded via nass_benchmarks.py and joined on year. They appear only at ADM0 level in the delivery CSV.

Key invariants

  • Every geo_identifier in a feature parquet or prediction artefact is at ADM2 level. ADM1 and ADM0 rows are produced only at the delivery boundary.
  • make_geo_identifier is the only constructor for new ADM2 identifiers from raw survey fields. Inline string formatting of ADM0:.../ADM1:.../ADM2:... outside this function is forbidden (DESIGN.md Clause 22: single canonical implementation).
  • The ADM level is determined by parsing the string, not by a separate column. Consumers must use GEO_ID_PATTERN or string splitting to determine level; there is no adm_level column in any parquet.

How it interacts with the pipeline

ADM2 identifiers are created at the NASS-loading step (nass.py:31) and flow through every feature parquet, fit parquet, and prediction parquet unchanged. At the deliver stage, walk_forward_preds_to_delivery_rows is called three times — once for each of ADM0, ADM1, and ADM2 — producing three separate delivery CSVs, one per level. The run_dir/delivery/ subtree therefore contains three files per mode (hindcast or forecast).

Pitfalls

  • Zarrs produced by the weather pipeline use opaque integer or hash-based geo IDs. These must be remapped to ADM identifiers via apply_geo_lookup before they can be joined to feature parquets. Skipping this step produces unmapped NaN rows.
  • Names containing punctuation (apostrophes in d'iberville, hyphens in winston-salem) are preserved by _normalise_name — the regex allows [a-z .'\-] after the first character. Callers that strip punctuation before calling make_geo_identifier will produce non-canonical identifiers that fail verify_geo_identifier.
  • assert_valid_geo_identifiers (identifiers.py:138) raises ValueError if any value in a column is invalid, listing up to five examples. It should be called after any step that constructs or normalises identifiers to surface bugs early.

See Also

  • DeliveryRow — the client-facing schema that carries geo_identifier at each ADM level in delivery CSVs

Open questions

  • The GEO_ID_PATTERN allows hyphen and apostrophe in name segments but the normalisation pipeline does not explicitly test that round-tripping through normalise_geo_identifierverify_geo_identifier preserves all allowed characters.
  • There is no ADM3 (sub-county) level; adding one would require changes to GEO_ID_PATTERN, make_geo_identifier, normalise_geo_identifier, and all delivery aggregation logic.