ADM Levels¶
What it is¶
The commodity_hindcast pipeline represents every geographic unit — from a county to a
country — as a string called a geo_identifier. This string encodes both the
geographic entity and its administrative level (ADM) in a single human-readable
path. There are three levels:
| Level | Name | Example |
|---|---|---|
| ADM0 | Country | ADM0:USA |
| ADM1 | State / province | ADM0:USA/ADM1:iowa |
| ADM2 | County / district | ADM0:USA/ADM1:iowa/ADM2:johnson |
The canonical format is defined in DESIGN.md Clause 17:
"all geographic identifiers normalised to lowercase ADM path
ADM0:usa/ADM1:{state}/ADM2:{county}— no FIPS codes, no mixed case"
(Note: Clause 17 shows ADM0:usa in lower case in the example, but the validation
regex enforces an uppercase three-letter country code: ADM0:[A-Z]{3}. The clause
example is informal shorthand; the enforced canonical form is ADM0:USA/ADM1:iowa/....)
Where it lives¶
| Symbol | File | Line |
|---|---|---|
GEO_ID_PATTERN |
lib/geo/identifiers.py |
107 |
GeoIdentifier (NewType) |
lib/geo/identifiers.py |
111 |
normalise_geo_identifier |
lib/geo/identifiers.py |
75 |
verify_geo_identifier |
lib/geo/identifiers.py |
114 |
make_geo_identifier |
lib/geo/identifiers.py |
207 |
apply_geo_lookup |
lib/geo/identifiers.py |
161 |
AggregationLevel |
lib/geo/aggregation.py |
10 |
aggregate_weighted_frame |
lib/geo/aggregation.py |
96 |
walk_forward_preds_to_delivery_rows |
delivery/conversions.py |
182 |
The GeoIdentifier type¶
GeoIdentifier (identifiers.py:111) is a NewType("GeoIdentifier", str) — a typed
string that has passed pattern validation. It is returned by verify_geo_identifier
and make_geo_identifier. Using NewType rather than a plain str makes the
distinction between validated and unvalidated identifiers visible in type hints.
The regex GEO_ID_PATTERN (identifiers.py:107) accepts exactly three formats:
ADM0:[A-Z]{3}
ADM0:[A-Z]{3}/ADM1:[a-z][a-z .'\-]*
ADM0:[A-Z]{3}/ADM1:[a-z][a-z .'\-]*/ADM2:[a-z][a-z .'\-]*
The level is therefore implicit in the string structure: the presence of /ADM2:
signals ADM2; a terminal /ADM1: without /ADM2: signals ADM1; a bare ADM0: with
no slash signals ADM0.
make_geo_identifier — construction from raw survey fields¶
identifiers.py:207 is the canonical constructor for ADM2 identifiers from NASS-style
survey data:
Its behaviour:
- Upper-cases
country_codeand validates it againstr"[A-Z]{3}"— a non-ISO-3 code raisesValueErrorimmediately (fails fast here rather than producing a non-canonical identifier that only fails later insideverify_geo_identifier). - Passes
stateandcountythrough_normalise_name, which lowercases, strips, ASCII-folds accented characters, and collapses whitespace. - Calls
verify_geo_identifieron the assembled path to produce a typedGeoIdentifier.
Call sites pass cfg.commodity.country_code as the country_code argument; no caller
hardcodes an ISO-3 code inline. The NASS loader (lib/reference_data/nass.py:31)
calls this function to turn the raw NASS county and state columns into canonical
ADM2 identifiers.
Normalisation — normalise_geo_identifier¶
identifiers.py:75 normalises a possibly-malformed identifier to canonical ASCII form.
It handles:
- Accented characters via
_to_ascii(NFD decomposition + accent strip) and a hand-curated_SPECIAL_CHAR_MAP(identifiers.py:29) for characters that do not decompose cleanly (ł → l, ø → o, æ → ae, ß → ss, etc.). - Mixed case: ADM0 segments are uppercased; ADM1 and ADM2 segments are lowercased.
- Whitespace: collapsed to single spaces.
Example: "ADM0:BRA/ADM1:São Paulo/ADM2:Campinas" → "ADM0:BRA/ADM1:sao paulo/ADM2:campinas".
The vectorised form normalise_geo_identifiers accepts a pd.Series or np.ndarray.
apply_geo_lookup — hash-to-ADM remapping¶
identifiers.py:161 maps an array of raw geo values (typically opaque hash IDs from a
zarr) to canonical ADM identifiers via a lookup CSV:
It uses pandas.Series.map for O(n) hash-table lookup. A special edge case is handled:
if all rows miss the keycol lookup but every input value already appears in the
valcol column, the inputs are returned unchanged — this covers zarrs that have
already been remapped to ADM identifiers (e.g. after a preprocessing step).
AggregationLevel — typed ADM level for delivery¶
lib/geo/aggregation.py:10 defines:
This is the type used by the delivery and diagnostics layers to specify which level
to aggregate to. It appears in aggregate_weighted_frame and
walk_forward_preds_to_delivery_rows.
Delivery aggregation: ADM2 → ADM1 → ADM0¶
delivery/conversions.py:182 (walk_forward_preds_to_delivery_rows) aggregates walk-
forward predictions from their native ADM2 county resolution up to the requested
delivery level:
if level == "ADM2":
# No aggregation needed — keep county rows directly
agg_pd = combined[keep_cols].copy()
else:
agg_pd = _aggregate_to_level(
combined,
value_cols,
level=level,
group_cols=["year", "init_date"],
area_col=area_col,
)
The _aggregate_to_level helper (conversions.py:55) delegates to
aggregate_weighted_frame (lib/geo/aggregation.py:96), which strips the trailing
ADM segment(s) from each geo_identifier to compute the target-level group key, then
applies area_weighted_mean within each (year, init_date, target_geo) group.
For example, when level="ADM1", the ADM2 identifier
ADM0:USA/ADM1:iowa/ADM2:johnson is reduced to ADM0:USA/ADM1:iowa and all Johnson
County rows are merged into the Iowa group weighted by area_harvested_ha.
When level="ADM0", the identifier is further reduced to ADM0:USA and state-level
groups are merged into the national aggregate.
Unweighted mean of yields is explicitly forbidden (DESIGN.md). area_weighted_mean
raises ValueError when a group has non-NaN yield but all-NaN area rather than
silently falling back to an unweighted mean.
National-level benchmark columns¶
At level="ADM0", walk_forward_preds_to_delivery_rows also joins two NASS
benchmark series from the full NASS county universe (not just the production-filtered
subset used for the model):
nass_actual_area_weighted_all— survey yield, area-weighted per yearnass_actual_prod_div_area_all— national production ÷ total area, per year
These are loaded via nass_benchmarks.py and joined on year. They appear only at
ADM0 level in the delivery CSV.
Key invariants¶
- Every
geo_identifierin a feature parquet or prediction artefact is at ADM2 level. ADM1 and ADM0 rows are produced only at the delivery boundary. make_geo_identifieris the only constructor for new ADM2 identifiers from raw survey fields. Inline string formatting ofADM0:.../ADM1:.../ADM2:...outside this function is forbidden (DESIGN.md Clause 22: single canonical implementation).- The ADM level is determined by parsing the string, not by a separate column. Consumers
must use
GEO_ID_PATTERNor string splitting to determine level; there is noadm_levelcolumn in any parquet.
How it interacts with the pipeline¶
ADM2 identifiers are created at the NASS-loading step (nass.py:31) and flow through
every feature parquet, fit parquet, and prediction parquet unchanged. At the deliver
stage, walk_forward_preds_to_delivery_rows is called three times — once for each of
ADM0, ADM1, and ADM2 — producing three separate delivery CSVs, one per level.
The run_dir/delivery/ subtree therefore contains three files per mode (hindcast or
forecast).
Pitfalls¶
- Zarrs produced by the weather pipeline use opaque integer or hash-based geo IDs.
These must be remapped to ADM identifiers via
apply_geo_lookupbefore they can be joined to feature parquets. Skipping this step produces unmapped NaN rows. - Names containing punctuation (apostrophes in
d'iberville, hyphens inwinston-salem) are preserved by_normalise_name— the regex allows[a-z .'\-]after the first character. Callers that strip punctuation before callingmake_geo_identifierwill produce non-canonical identifiers that failverify_geo_identifier. assert_valid_geo_identifiers(identifiers.py:138) raisesValueErrorif any value in a column is invalid, listing up to five examples. It should be called after any step that constructs or normalises identifiers to surface bugs early.
Related entities and concepts¶
- DESIGN.md Clause 17 — the canonical ADM format requirement
- DESIGN.md Clause 22 — single canonical implementation per concept
- lib.md —
identifiers.pyandaggregation.pysurvey
See Also¶
- DeliveryRow — the client-facing schema that carries
geo_identifierat each ADM level in delivery CSVs
Open questions¶
- The
GEO_ID_PATTERNallows hyphen and apostrophe in name segments but the normalisation pipeline does not explicitly test that round-tripping throughnormalise_geo_identifier→verify_geo_identifierpreserves all allowed characters. - There is no ADM3 (sub-county) level; adding one would require changes to
GEO_ID_PATTERN,make_geo_identifier,normalise_geo_identifier, and all delivery aggregation logic.