FeatureBuilder¶
Definition¶
FeatureBuilder is the conceptual role label for any callable that loads one raw data source and returns a validated feature DataFrame. The actual contract is expressed as the BuilderFn Protocol defined at features/builders/interface.py:25. There is no ABC — the protocol is structural, not nominal. Any callable whose signature matches (path, cfg, years) → pd.DataFrame satisfies it.
Kind¶
Protocol (@runtime_checkable structural typing). Not an ABC. Implementations are plain module-level functions, not classes.
Source of truth¶
market_insights_models/src/commodity_hindcast/features/builders/interface.py:25 — BuilderFn Protocol definition.
Required interface¶
class BuilderFn(Protocol):
def __call__(
self,
path: Path | CloudPath,
cfg: ExperimentConfig,
years: range,
/,
) -> pd.DataFrame: ...
The returned DataFrame must:
- Contain all three
INDEX_COLS = ("year", "geo_identifier", "init_date")(interface.py:21). - Have no duplicate column names.
- Have no duplicate
(year, geo_identifier, init_date)key tuples. - Contain only valid
geo_identifiervalues (validated byassert_valid_geo_identifiers).
These invariants are enforced by validate_builder_output (interface.py:34) before the frame is handed to assemble.
The protocol is @runtime_checkable so isinstance(fn, BuilderFn) works at dispatch time and in tests (interface.py:24).
Dispatch model¶
Builders are not dispatched via a discriminated union at runtime. The registry is a plain dict[str, BuilderFn] in features/builders/registry.py:
BUILDER_REGISTRY: dict[str, BuilderFn] = {
"yields": build_yields,
"stress": build_stress,
"weather": build_weather,
"climo": build_climo,
"ndvi": build_ndvi,
}
run_builder(name, cfg, years) looks up the function by the YAML builder-block key, resolves cfg.commodity.builders[name].filepath, calls the function, then calls validate_builder_output on the result (registry.py:39–47).
Adding a builder requires three steps: implement the BuilderFn protocol, add an entry to BUILDER_REGISTRY, and declare the builder under commodity.builders in the relevant YAML config.
Concrete implementations¶
| Builder function | Registry key | File | Aggregation | Source format | When to use |
|---|---|---|---|---|---|
build_yields |
yields |
builders/yields.py:174 |
expand_init_dates (no windowing) |
NASS parquet | Lagged yield signals and area metadata; required for every commodity |
build_weather |
weather |
builders/weather.py:83 |
sum over SeasonWindows |
Daily zarr (GDD, EDD, precip) | Current-season cumulative weather; primary signal |
build_climo |
climo |
builders/climo.py:60 |
mean over SeasonWindows |
Annual z-score zarr | Climatology anomalies (z-scores relative to baseline) |
build_ndvi |
ndvi |
builders/ndvi.py:147 |
max (peak) + mean (cumulative) |
Monthly county CSV | Vegetation greenness signal; lag_days=0 |
build_stress |
stress |
builders/stress.py:87 |
expand_init_dates with lag-1 year shift |
Pre-computed stress parquet | Composite weather stress score from prior season |
All five share the build_windowed_features engine in builders/core.py for the windowed variants (weather, climo, ndvi). Yields and stress use expand_init_dates (core.py:37) instead.
Lifecycle¶
Instantiation: There is no instantiation — builders are module-level functions. BUILDER_REGISTRY is populated at module import time.
Invocation: build_features in features/run.py:81 iterates cfg.commodity.builders.keys() in declaration order and calls run_builder for each. Each builder's output is written to builders/<name>.parquet before assemble is called. Intermediate DataFrames are explicitly deleted after write to free memory (run.py:92).
Skip semantics: If builders/<name>.parquet already exists, the builder is skipped unless force=True (run.py:83–88).
Tear-down: None. Builders are stateless functions; no clean-up is needed.
Relationships¶
ExperimentConfig(config.py:573) — passed ascfgto every builder; providescommodity.builders[name].filepath, window definitions, andfeature_cols.FeatureBuilderConfigvariants (YieldsBuilder,WeatherBuilder,ClimoBuilder,NDVIBuilder,StressBuilder) — typed config objects inconfig.py:187–213that carry the per-builderfilepathand parameters; the YAMLtypefield maps directly to the registry key.assemble(features/assemble.py) — consumes the list of builder parquet paths; applies inner/left-join logic driven byrequired_for_pred_parquet.validate_builder_output(interface.py:34) — guards every builder's output before it reachesassemble.INDEX_COLS(interface.py:21) — the merge contract:("year", "geo_identifier", "init_date").- ExperimentConfig — root config object.
Concepts and pipelines¶
- pipeline_features — stage-by-stage walkthrough of the full feature build (page not yet written).
- source_features — detailed module-by-module breakdown of the feature subsystem.
SeasonWindow— window definition consumed bybuild_windowed_features; controls the(start_sdoy, end_sdoy)range for each cumulative feature.freeze_cap_sdoy— commodity-level config value that clamps the progressive window once the init date passes the crop's physiological freeze point (core.py:109–124).
PRs and commits¶
- PR #369 (
feat(commodity_hindcast): forecast multiple season_years per init_date) — introducedsynthesise_long_range_climo_for_unseen_yearsandsynthesise_long_range_stress_for_unseen_yearsstubs that pre-populate builder parquets for future season years beforebuild_featuresruns.
Open questions¶
build_weatherhas aTODOatcore.py:75flagging that the windowed-feature logic needs a manual sense-check for correctness at season boundaries.build_yieldshas aTODOatyields.py:219noting that area forecasts are not yet incorporated; the builder uses Y-1 area estimates throughout.build_weatherhard-codes corn daily index specs inforecast_weather.py:29–33; extending to other commodities requires updating_CORN_DAILY_INDICES.- The
STATEFP_TO_STATElookup inndvi.py:32–33is acknowledged as a workaround for dirty NDVI data and is flagged for removal.