Skip to content

Entity: FeatureBuilderConfig

Definition

FeatureBuilderConfig is the collective name for the five concrete Pydantic models that configure how each data source is loaded into a feature parquet keyed by (year, geo_identifier, init_date). They share a common base (BaseBuilderConfig) and are collected in the CommodityConfig.builders: dict[str, Builder] registry. The Builder type alias is a pydantic-discriminated union dispatching on the type field.

Each builder config corresponds to a module-level function in features/builders/ that satisfies the BuilderFn protocol.

Kind

BaseBuilderConfig — Pydantic BaseModel, frozen=True. Concrete subclasses add a type: Literal[...] discriminator.

Builder union declared at market_insights_models/src/commodity_hindcast/config.py:250:

Builder = Annotated[
    YieldsBuilder | StressBuilder | ClimoBuilder | WeatherBuilder | NDVIBuilder,
    Field(discriminator="type"),
]

Source of truth

  • BaseBuilderConfig: config.py:168
  • YieldsBuilder: config.py:190
  • StressBuilder: config.py:216
  • WeatherBuilder: config.py:226
  • ClimoBuilder: config.py:233
  • NDVIBuilder: config.py:242
  • Builder union: config.py:250

Base attributes (BaseBuilderConfig)

Field Type Default Meaning YAML example
filepath ResolvablePath required Path to the source file/zarr. Relative paths resolved against data_root at ExperimentConfig load time data/nass/preprocessed_corn.parquet
geo_id_col str "geo_id" Column name in the source that carries the raw geo identifier before normalisation geo_id
required_for_pred_parquet bool False When True, builder is inner-joined to define the (year, geo, init_date) universe of pred.parquet; its feature columns are structurally non-NaN. When False, left-joined — coverage gaps produce NaN and a WARNING log. Weather/stress/climo should be true; yields should be false so pred.parquet can extend beyond the last labelled year true

Builder variants

YieldsBuilder (type: "yields")

Reads the preprocessed NASS (US) or IBGE-PAM (Brazil) yields parquet and applies row-level Fellegi-Holt edit rules before the feature pivot.

Source: config.py:190

Field Type Default Meaning YAML example
type Literal["yields"] Discriminator (auto-injected from dict key)
county_col str required Column carrying the county identifier county_ansi
state_col str required Column carrying the state identifier state_alpha
production_col str \| None None Production column; if set, enables deductive yield imputation from production/area production_bu
crop_type str \| None None Filter rows to this crop type string (e.g. "CORN") CORN
edits list[EditRuleConfig] [] Ordered Fellegi-Holt edit rules applied before the yield pivot see EditRuleConfig entity

required_for_pred_parquet defaults to False for yields (the survey series is the target variable and must not constrain the prediction universe).

WeatherBuilder (type: "weather")

Reads a weather-indices zarr (e.g. QUBE CONUS ADM2 corn indices). Produces windowed accumulations over CommodityConfig.weather_windows.

Source: config.py:226

Field Type Default Meaning YAML example
type Literal["weather"] Discriminator
time_dim str "time" Name of the time dimension in the zarr time

filepath typically points to an S3 zarr with {env} template (e.g. s3://{env}-treefera-greenprint-data/weather/processed/indices/conus_adm2_corn.zarr). required_for_pred_parquet: true — weather coverage defines the prediction universe.

ClimoBuilder (type: "climo")

Reads a climatology-indices zarr. Produces z-score and climo-window features over CommodityConfig.climo_windows and CommodityConfig.climo_zscore_vars. Optionally remaps geo identifiers via a lookup table.

Source: config.py:233

Field Type Default Meaning YAML example
type Literal["climo"] Discriminator
geo_lookup_path ResolvablePath \| None None Optional lookup parquet to remap source geo IDs to canonical GeoIdentifiers None
geo_lookup_keycol str \| None None Source column in the lookup table None
geo_lookup_valcol str \| None None Target column in the lookup table None

Brazil soy uses geo_id_col: "identifier" (not the US-standard "geoid") — a key difference documented in the Brazil config source page.

NDVIBuilder (type: "ndvi")

Reads NDVI satellite data keyed by county + state FIPS codes.

Source: config.py:242

Field Type Default Meaning YAML example
type Literal["ndvi"] Discriminator
county_col str required County FIPS column in source countyfp
statefp_col str required State FIPS column in source statefp

StressBuilder (type: "stress")

Reads a composite stress-index parquet; optionally regenerates that parquet from a raw indices zarr via AssembleStressConfig. Used only for corn (US). Carries an optional column rename map and year/lag configuration.

Source: config.py:216

Field Type Default Meaning YAML example
type Literal["stress"] Discriminator
assemble_stress_from_indices AssembleStressConfig \| None None If set, regenerates the stress parquet from indices_zarr during the features stage; preflight skips filepath existence check in this case see below
rename_map dict[str, str] {} Column renames applied after load (e.g. stress_score → stress_score_lag1) {stress_score: stress_score_lag1}
year_col str \| None None Override for the year column name in the stress source None
lag_years int \| None None Shift year values by this many years (implements the lag-1 carryover signal) None

AssembleStressConfig fields: indices_zarr (ResolvablePath), gs_start_doy, gs_end_doy, baseline_start, baseline_end (all int), overwrite: bool = False.

Discriminator auto-injection

CommodityConfig._inject_builder_type_from_key (config.py:354, mode=before) copies each builders dict key into the payload's type field when type is absent. YAML authors write:

builders:
  stress:
    filepath: data/stress/preprocessed_corn_stress.parquet

and the validator inserts type: "stress" automatically. Explicit type: in YAML takes precedence.

Lifecycle

  1. Builders are defined in CommodityConfig.builders and validated at config load.
  2. ExperimentConfig._resolve_data_paths resolves all filepath (ResolvablePath) fields against data_root.
  3. preflight_paths_for_features skips stress filepath when assemble_stress_from_indices is set.
  4. During the features stage, each builder's filepath is read by the corresponding BuilderFn function and the results are joined on (year, geo_identifier, init_date).

Relationships

  • Owned by CommodityConfig.builders (1:N from CommodityConfig).
  • Drives FeatureBuilder protocol implementations (features/builders/{yields,weather,climo,ndvi,stress}.py).
  • YieldsBuilder contains list[EditRuleConfig] (Fellegi-Holt edit rules applied before the pivot).
  • StressBuilder contains optional AssembleStressConfig.

Concepts and pipelines that touch this entity

PRs and commits

  • PR #345 (PR-345.md) — ClimoBuilder.geo_lookup_* fields added for Brazil geo-ID remapping.
  • PR #353 — Brazil soybean config introduced the identifier vs geoid geo_id_col distinction for the climo builder.

Open questions

  • NDVIBuilder is declared but no active commodity YAML uses it; the US wheat and cotton configs do not include an NDVI builder. It may be a forward stub.
  • StressBuilder.lag_years and year_col are present but None in all production configs; the intended use case is not documented in code comments.