Source: features/README.md — Feature Assembly Orchestrator¶
What it is¶
A concise (10-line) description of the two-function orchestration layer for feature assembly: build_features (the orchestrator) and assemble (the finaliser). Together they define the contract that all builder outputs must satisfy and describe the two canonical feature parquets (fit.parquet and pred.parquet) consumed by every downstream stage.
Section-by-section summary¶
build_features¶
The orchestrator runs each builder sequentially, saves intermediate parquets, then calls assemble. It is the top-level entry point invoked by cli run features.
assemble — the finaliser¶
Four steps:
- Merge — reads each builder parquet and inner-joins them on index columns
(geo_identifier, year, init_date). - Fit split — filters rows where
init_datematches the configured harvest date for that year (end-of-harvest features). - Writes two unified parquets —
fit.parquetandpred.parquet, each containing index columns + feature columns + target column. "Identity columns are always included to guarantee row alignment by construction." - Metadata — writes
metadata.jsonwith column lists (index_cols,feature_cols,target_col), row counts, and year range.
fit vs pred semantics¶
fit= end-of-harvest features (one row per geo per year).pred= in-season features used for walk-forward point-in-time estimates (all init dates).
The harvest-date split is the sole mechanism that partitions the two parquets — there is no separate "train/test split" at the feature level; that split happens downstream at the fold level.
Notable claims (the load-bearing ones)¶
- Index columns
(geo_identifier, year, init_date)are the canonical join keys. They are always included — never separated into a distinct file. metadata.jsonis the machine-readable contract: downstream consumers use it to sliceindex_cols,feature_cols,target_colfrom the unified parquet. This avoids hardcoding column names in stage code.- The inner join on index columns at the merge step means that a geo/year/init_date present in one builder but absent in another is silently dropped — the intersection defines the modelled universe for that run.
fit.parquetis the only input to the FIT stage;pred.parquetis the input to the PREDICT walk-forward loop. The two files are never mixed.
What this document is NOT¶
This README does not describe builder internals or the builder protocol — that is features/builders/README.md. It does not specify the schema of any column beyond the index keys.
Cross-references¶
- features_builders_README.md — builder protocol and registry
- DESIGN.md — Clause on
fit.parquetcolumn layout (INDEX_COLS leading, never separated) - in_package_DOMAIN_MODEL.md —
fit.parquet/pred.parquet/metadata.jsonin the Data-shape vocabulary - README.md — CLI command
cli run featuresand the pipeline diagram showing where features land