Entity: FeatureBuilderConfig¶
Definition¶
FeatureBuilderConfig is the collective name for the five concrete Pydantic models that configure how each data source is loaded into a feature parquet keyed by (year, geo_identifier, init_date). They share a common base (BaseBuilderConfig) and are collected in the CommodityConfig.builders: dict[str, Builder] registry. The Builder type alias is a pydantic-discriminated union dispatching on the type field.
Each builder config corresponds to a module-level function in features/builders/ that satisfies the BuilderFn protocol.
Kind¶
BaseBuilderConfig — Pydantic BaseModel, frozen=True. Concrete subclasses add a type: Literal[...] discriminator.
Builder union declared at market_insights_models/src/commodity_hindcast/config.py:250:
Builder = Annotated[
YieldsBuilder | StressBuilder | ClimoBuilder | WeatherBuilder | NDVIBuilder,
Field(discriminator="type"),
]
Source of truth¶
BaseBuilderConfig:config.py:168YieldsBuilder:config.py:190StressBuilder:config.py:216WeatherBuilder:config.py:226ClimoBuilder:config.py:233NDVIBuilder:config.py:242Builderunion:config.py:250
Base attributes (BaseBuilderConfig)¶
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
filepath |
ResolvablePath |
required | Path to the source file/zarr. Relative paths resolved against data_root at ExperimentConfig load time |
data/nass/preprocessed_corn.parquet |
geo_id_col |
str |
"geo_id" |
Column name in the source that carries the raw geo identifier before normalisation | geo_id |
required_for_pred_parquet |
bool |
False |
When True, builder is inner-joined to define the (year, geo, init_date) universe of pred.parquet; its feature columns are structurally non-NaN. When False, left-joined — coverage gaps produce NaN and a WARNING log. Weather/stress/climo should be true; yields should be false so pred.parquet can extend beyond the last labelled year |
true |
Builder variants¶
YieldsBuilder (type: "yields")¶
Reads the preprocessed NASS (US) or IBGE-PAM (Brazil) yields parquet and applies row-level Fellegi-Holt edit rules before the feature pivot.
Source: config.py:190
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
type |
Literal["yields"] |
— | Discriminator (auto-injected from dict key) | — |
county_col |
str |
required | Column carrying the county identifier | county_ansi |
state_col |
str |
required | Column carrying the state identifier | state_alpha |
production_col |
str \| None |
None |
Production column; if set, enables deductive yield imputation from production/area | production_bu |
crop_type |
str \| None |
None |
Filter rows to this crop type string (e.g. "CORN") |
CORN |
edits |
list[EditRuleConfig] |
[] |
Ordered Fellegi-Holt edit rules applied before the yield pivot | see EditRuleConfig entity |
required_for_pred_parquet defaults to False for yields (the survey series is the target variable and must not constrain the prediction universe).
WeatherBuilder (type: "weather")¶
Reads a weather-indices zarr (e.g. QUBE CONUS ADM2 corn indices). Produces windowed accumulations over CommodityConfig.weather_windows.
Source: config.py:226
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
type |
Literal["weather"] |
— | Discriminator | — |
time_dim |
str |
"time" |
Name of the time dimension in the zarr | time |
filepath typically points to an S3 zarr with {env} template (e.g. s3://{env}-treefera-greenprint-data/weather/processed/indices/conus_adm2_corn.zarr).
required_for_pred_parquet: true — weather coverage defines the prediction universe.
ClimoBuilder (type: "climo")¶
Reads a climatology-indices zarr. Produces z-score and climo-window features over CommodityConfig.climo_windows and CommodityConfig.climo_zscore_vars. Optionally remaps geo identifiers via a lookup table.
Source: config.py:233
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
type |
Literal["climo"] |
— | Discriminator | — |
geo_lookup_path |
ResolvablePath \| None |
None |
Optional lookup parquet to remap source geo IDs to canonical GeoIdentifiers |
None |
geo_lookup_keycol |
str \| None |
None |
Source column in the lookup table | None |
geo_lookup_valcol |
str \| None |
None |
Target column in the lookup table | None |
Brazil soy uses geo_id_col: "identifier" (not the US-standard "geoid") — a key difference documented in the Brazil config source page.
NDVIBuilder (type: "ndvi")¶
Reads NDVI satellite data keyed by county + state FIPS codes.
Source: config.py:242
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
type |
Literal["ndvi"] |
— | Discriminator | — |
county_col |
str |
required | County FIPS column in source | countyfp |
statefp_col |
str |
required | State FIPS column in source | statefp |
StressBuilder (type: "stress")¶
Reads a composite stress-index parquet; optionally regenerates that parquet from a raw indices zarr via AssembleStressConfig. Used only for corn (US). Carries an optional column rename map and year/lag configuration.
Source: config.py:216
| Field | Type | Default | Meaning | YAML example |
|---|---|---|---|---|
type |
Literal["stress"] |
— | Discriminator | — |
assemble_stress_from_indices |
AssembleStressConfig \| None |
None |
If set, regenerates the stress parquet from indices_zarr during the features stage; preflight skips filepath existence check in this case |
see below |
rename_map |
dict[str, str] |
{} |
Column renames applied after load (e.g. stress_score → stress_score_lag1) |
{stress_score: stress_score_lag1} |
year_col |
str \| None |
None |
Override for the year column name in the stress source | None |
lag_years |
int \| None |
None |
Shift year values by this many years (implements the lag-1 carryover signal) | None |
AssembleStressConfig fields: indices_zarr (ResolvablePath), gs_start_doy, gs_end_doy, baseline_start, baseline_end (all int), overwrite: bool = False.
Discriminator auto-injection¶
CommodityConfig._inject_builder_type_from_key (config.py:354, mode=before) copies each builders dict key into the payload's type field when type is absent. YAML authors write:
and the validator inserts type: "stress" automatically. Explicit type: in YAML takes precedence.
Lifecycle¶
- Builders are defined in
CommodityConfig.buildersand validated at config load. ExperimentConfig._resolve_data_pathsresolves allfilepath(ResolvablePath) fields againstdata_root.preflight_paths_for_featuresskips stressfilepathwhenassemble_stress_from_indicesis set.- During the features stage, each builder's
filepathis read by the correspondingBuilderFnfunction and the results are joined on(year, geo_identifier, init_date).
Relationships¶
- Owned by
CommodityConfig.builders(1:N fromCommodityConfig). - Drives
FeatureBuilderprotocol implementations (features/builders/{yields,weather,climo,ndvi,stress}.py). YieldsBuildercontainslist[EditRuleConfig](Fellegi-Holt edit rules applied before the pivot).StressBuildercontains optionalAssembleStressConfig.
Concepts and pipelines that touch this entity¶
- Pipeline: feature build — builders are iterated;
required_for_pred_parquetcontrols join semantics. - Entity: ExperimentConfig —
_resolve_data_pathsresolves allfilepathfields. - Concept: ResolvablePath safety — builder
filepathfields are a primary source ofResolvablePathvalues in the config tree.
PRs and commits¶
- PR #345 (PR-345.md) —
ClimoBuilder.geo_lookup_*fields added for Brazil geo-ID remapping. - PR #353 — Brazil soybean config introduced the
identifiervsgeoidgeo_id_col distinction for the climo builder.
Open questions¶
NDVIBuilderis declared but no active commodity YAML uses it; the US wheat and cotton configs do not include an NDVI builder. It may be a forward stub.StressBuilder.lag_yearsandyear_colare present butNonein all production configs; the intended use case is not documented in code comments.