Skip to content

Source: features/README.md — Feature Assembly Orchestrator

What it is

A concise (10-line) description of the two-function orchestration layer for feature assembly: build_features (the orchestrator) and assemble (the finaliser). Together they define the contract that all builder outputs must satisfy and describe the two canonical feature parquets (fit.parquet and pred.parquet) consumed by every downstream stage.

Section-by-section summary

build_features

The orchestrator runs each builder sequentially, saves intermediate parquets, then calls assemble. It is the top-level entry point invoked by cli run features.

assemble — the finaliser

Four steps:

  1. Merge — reads each builder parquet and inner-joins them on index columns (geo_identifier, year, init_date).
  2. Fit split — filters rows where init_date matches the configured harvest date for that year (end-of-harvest features).
  3. Writes two unified parquetsfit.parquet and pred.parquet, each containing index columns + feature columns + target column. "Identity columns are always included to guarantee row alignment by construction."
  4. Metadata — writes metadata.json with column lists (index_cols, feature_cols, target_col), row counts, and year range.

fit vs pred semantics

fit = end-of-harvest features (one row per geo per year). pred = in-season features used for walk-forward point-in-time estimates (all init dates).

The harvest-date split is the sole mechanism that partitions the two parquets — there is no separate "train/test split" at the feature level; that split happens downstream at the fold level.

Notable claims (the load-bearing ones)

  • Index columns (geo_identifier, year, init_date) are the canonical join keys. They are always included — never separated into a distinct file.
  • metadata.json is the machine-readable contract: downstream consumers use it to slice index_cols, feature_cols, target_col from the unified parquet. This avoids hardcoding column names in stage code.
  • The inner join on index columns at the merge step means that a geo/year/init_date present in one builder but absent in another is silently dropped — the intersection defines the modelled universe for that run.
  • fit.parquet is the only input to the FIT stage; pred.parquet is the input to the PREDICT walk-forward loop. The two files are never mixed.

What this document is NOT

This README does not describe builder internals or the builder protocol — that is features/builders/README.md. It does not specify the schema of any column beyond the index keys.

Cross-references