Skip to content

FeatureBuilder

Definition

FeatureBuilder is the conceptual role label for any callable that loads one raw data source and returns a validated feature DataFrame. The actual contract is expressed as the BuilderFn Protocol defined at features/builders/interface.py:25. There is no ABC — the protocol is structural, not nominal. Any callable whose signature matches (path, cfg, years) → pd.DataFrame satisfies it.

Kind

Protocol (@runtime_checkable structural typing). Not an ABC. Implementations are plain module-level functions, not classes.

Source of truth

market_insights_models/src/commodity_hindcast/features/builders/interface.py:25BuilderFn Protocol definition.

Required interface

class BuilderFn(Protocol):
    def __call__(
        self,
        path: Path | CloudPath,
        cfg: ExperimentConfig,
        years: range,
        /,
    ) -> pd.DataFrame: ...

The returned DataFrame must:

  • Contain all three INDEX_COLS = ("year", "geo_identifier", "init_date") (interface.py:21).
  • Have no duplicate column names.
  • Have no duplicate (year, geo_identifier, init_date) key tuples.
  • Contain only valid geo_identifier values (validated by assert_valid_geo_identifiers).

These invariants are enforced by validate_builder_output (interface.py:34) before the frame is handed to assemble.

The protocol is @runtime_checkable so isinstance(fn, BuilderFn) works at dispatch time and in tests (interface.py:24).

Dispatch model

Builders are not dispatched via a discriminated union at runtime. The registry is a plain dict[str, BuilderFn] in features/builders/registry.py:

BUILDER_REGISTRY: dict[str, BuilderFn] = {
    "yields":  build_yields,
    "stress":  build_stress,
    "weather": build_weather,
    "climo":   build_climo,
    "ndvi":    build_ndvi,
}

run_builder(name, cfg, years) looks up the function by the YAML builder-block key, resolves cfg.commodity.builders[name].filepath, calls the function, then calls validate_builder_output on the result (registry.py:39–47).

Adding a builder requires three steps: implement the BuilderFn protocol, add an entry to BUILDER_REGISTRY, and declare the builder under commodity.builders in the relevant YAML config.

Concrete implementations

Builder function Registry key File Aggregation Source format When to use
build_yields yields builders/yields.py:174 expand_init_dates (no windowing) NASS parquet Lagged yield signals and area metadata; required for every commodity
build_weather weather builders/weather.py:83 sum over SeasonWindows Daily zarr (GDD, EDD, precip) Current-season cumulative weather; primary signal
build_climo climo builders/climo.py:60 mean over SeasonWindows Annual z-score zarr Climatology anomalies (z-scores relative to baseline)
build_ndvi ndvi builders/ndvi.py:147 max (peak) + mean (cumulative) Monthly county CSV Vegetation greenness signal; lag_days=0
build_stress stress builders/stress.py:87 expand_init_dates with lag-1 year shift Pre-computed stress parquet Composite weather stress score from prior season

All five share the build_windowed_features engine in builders/core.py for the windowed variants (weather, climo, ndvi). Yields and stress use expand_init_dates (core.py:37) instead.

Lifecycle

Instantiation: There is no instantiation — builders are module-level functions. BUILDER_REGISTRY is populated at module import time.

Invocation: build_features in features/run.py:81 iterates cfg.commodity.builders.keys() in declaration order and calls run_builder for each. Each builder's output is written to builders/<name>.parquet before assemble is called. Intermediate DataFrames are explicitly deleted after write to free memory (run.py:92).

Skip semantics: If builders/<name>.parquet already exists, the builder is skipped unless force=True (run.py:83–88).

Tear-down: None. Builders are stateless functions; no clean-up is needed.

Relationships

  • ExperimentConfig (config.py:573) — passed as cfg to every builder; provides commodity.builders[name].filepath, window definitions, and feature_cols.
  • FeatureBuilderConfig variants (YieldsBuilder, WeatherBuilder, ClimoBuilder, NDVIBuilder, StressBuilder) — typed config objects in config.py:187–213 that carry the per-builder filepath and parameters; the YAML type field maps directly to the registry key.
  • assemble (features/assemble.py) — consumes the list of builder parquet paths; applies inner/left-join logic driven by required_for_pred_parquet.
  • validate_builder_output (interface.py:34) — guards every builder's output before it reaches assemble.
  • INDEX_COLS (interface.py:21) — the merge contract: ("year", "geo_identifier", "init_date").
  • ExperimentConfig — root config object.

Concepts and pipelines

  • pipeline_features — stage-by-stage walkthrough of the full feature build (page not yet written).
  • source_features — detailed module-by-module breakdown of the feature subsystem.
  • SeasonWindow — window definition consumed by build_windowed_features; controls the (start_sdoy, end_sdoy) range for each cumulative feature.
  • freeze_cap_sdoy — commodity-level config value that clamps the progressive window once the init date passes the crop's physiological freeze point (core.py:109–124).

PRs and commits

  • PR #369 (feat(commodity_hindcast): forecast multiple season_years per init_date) — introduced synthesise_long_range_climo_for_unseen_years and synthesise_long_range_stress_for_unseen_years stubs that pre-populate builder parquets for future season years before build_features runs.

Open questions

  • build_weather has a TODO at core.py:75 flagging that the windowed-feature logic needs a manual sense-check for correctness at season boundaries.
  • build_yields has a TODO at yields.py:219 noting that area forecasts are not yet incorporated; the builder uses Y-1 area estimates throughout.
  • build_weather hard-codes corn daily index specs in forecast_weather.py:29–33; extending to other commodities requires updating _CORN_DAILY_INDICES.
  • The STATEFP_TO_STATE lookup in ndvi.py:32–33 is acknowledged as a workaround for dirty NDVI data and is flagged for removal.