Pipeline: Preflight¶
Purpose¶
Preflight is the guard layer that runs immediately before any compute-intensive stage. It walks every ResolvablePath field on the ExperimentConfig tree, resolves each against data_root, and calls check_path_exists (cloud-safe via cloudpathlib.AnyPath). On the first critical failure, run_preflight raises SystemExit with a message naming the missing path. This ensures a run never starts only to fail 30 minutes later deep inside a builder or model loader. Failing loud at config-load time is a first-class design decision (DESIGN.md Clause 31).
Inputs¶
| Input | Path | Format | Producer |
|---|---|---|---|
ExperimentConfig |
In-memory from YAML + env | Pydantic model | _prepare_config() in cli.py:112 |
| Raw data files | Various — resolved via config.data_root |
.parquet, .zarr, .csv, .pkl |
External sources (NASS, weather zarr, climo zarr) |
| Feature parquets (hindcast/forecast stages) | {features_dir}/{experiment_key}/fit.parquet + pred.parquet |
Parquet | run features / build_features |
| Production model artefacts (forecast-predict) | {run_dir}/models/{experiment_key}/production/detrender.pkl + feature_fill_values.parquet |
Pickle / Parquet | run_fit stage |
| Forecast features parquet (forecast-predict) | {run_dir}/forecast/{season_year}/{init_date}/features/pred.parquet |
Parquet | run_forecast.run_features |
Outputs¶
Preflight has no file outputs. It either returns normally (all checks pass) or raises SystemExit (first critical failure). Log lines at SUCCESS / ERROR level are emitted for every check.
Step-by-step flow¶
The five public check-set functions share the same execution model: build a list[Check], pass it to run_preflight, which iterates and halts on the first critical failure.
Check-set 1 — preflight_paths_for_features (preflight.py:110)¶
Called by cli.run_features_cmd before build_features.
- Iterate
_iter_resolvable_fields(config)to yield every(owner, field_name)pair typedResolvablePath(preflight.py:122). - Skip the stress builder's
filepathwhenassemble_stress_from_indicesis configured, because that parquet is produced during this stage —_skip_stress_filepath_preflight(preflight.py:93) applies duck-typing onowner.type == "stress"andowner.assemble_stress_from_indices is not None. - For each remaining non-
Noneresolved value, emitcheck_path_exists(value)(preflight.py:125–128).
Check-set 2 — preflight_paths_for_hindcast (preflight.py:59)¶
Called by run_hindcast.run() before the walk-forward phase.
- Iterate
config.check_data_existsentries, resolving each via_resolve_check_path(preflight.py:67–68). - Check
{features_dir}/{experiment_key}/fit.parquet— training features must pre-exist (preflight.py:72). - Check
{features_dir}/{experiment_key}/pred.parquet— prediction features must pre-exist (preflight.py:73).
Reference parquet paths (from config.reference_data[i].filepath) are covered automatically by the ResolvablePath walker rather than a hand-maintained list (DESIGN.md Clause 31).
Check-set 3 — preflight_paths_for_forecast_features (preflight.py:132)¶
Called by run_forecast.run_features() before materialising forecast indices.
preflight_paths_for_resolvable_inputs(config)— allResolvablePathfields includingraw_obs_filepathandmaterialised_climo_filepath(preflight.py:141).- Check canonical
pred.parquetat{features_dir}/{experiment_key}/pred.parquet— required for area imputation in_impute_forecast_area(preflight.py:142–143; DESIGN.md line 114).
Check-set 4 — preflight_paths_for_forecast_predict (preflight.py:147)¶
Called by run_forecast.run_predict() before scoring.
preflight_paths_for_resolvable_inputs(config)— allResolvablePathfields (preflight.py:163).- Check
{run_dir}/forecast/{season_year}/{init_date:%Y-%m-%d}/features/pred.parquet— per-(season_year, init_date)forecast features built by the prior sub-stage (preflight.py:166–174). - Check
{run_dir}/models/{experiment_key}/production/detrender.pkl(preflight.py:177). - Check
{run_dir}/models/{experiment_key}/production/feature_fill_values.parquet(preflight.py:178).
Check-set 5 — preflight_paths_for_export (preflight.py:205)¶
Called by cli.run_export_cmd before export.
preflight_paths_for_resolvable_inputs(config)— resolvable inputs only (preflight.py:207).
A sixth convenience function preflight_paths_for_forecast (preflight.py:183) is the union of check-sets 3 and 4; used by the thin run_forecast composer.
Mermaid flow diagram¶
flowchart LR
CFG["ExperimentConfig\n(data_root resolved)"]
ITER["_iter_resolvable_fields(config)\npreflight.py:85"]
C1["preflight_paths_for_features\npreflight.py:110"]
C2["preflight_paths_for_hindcast\npreflight.py:59"]
C3["preflight_paths_for_forecast_features\npreflight.py:132"]
C4["preflight_paths_for_forecast_predict\npreflight.py:147"]
C5["preflight_paths_for_export\npreflight.py:205"]
RP["run_preflight(checks)\npreflight.py:42"]
OK["Stage proceeds"]
FAIL["SystemExit\n(critical check failed)"]
CFG --> ITER
ITER --> C1
ITER --> C2
ITER --> C3
ITER --> C4
ITER --> C5
C1 --> RP
C2 --> RP
C3 --> RP
C4 --> RP
C5 --> RP
RP -->|all pass| OK
RP -->|first critical fail| FAIL
Invariants and contracts¶
DESIGN.md Clause 31 (verbatim):
"WHEN a config field is typed
ResolvablePath, the system SHALL assert the resolved target exists viacheck_path_existsin a per-stagepreflight_paths_for_<stage>invoked before the consuming stage executes. The check set must be derived from_iter_resolvable_fields(config), not hand-maintained, so that adding a newResolvablePathfield extends preflight coverage by construction."
Every check_path_exists call sets critical=True (preflight.py:38). run_preflight stops on the first failure — it does not accumulate all failures before raising. Non-critical checks (warnings) are supported by the Check dataclass but none of the current check-set functions emit them.
Failure modes and recovery¶
| Symptom | Likely cause | Recovery |
|---|---|---|
SystemExit: Critical preflight check failed: path_exists:{features_dir}/corn_usa/fit.parquet |
Features not yet built | Run make features or cli run features --config corn_usa |
SystemExit: … detrender.pkl |
Production fit not present | Run make fit-production RUN_DIR=… or cli run fit-production |
SystemExit: … forecast/{y}/{d}/features/pred.parquet |
Forecast features not built | Run cli run forecast-features --run-dir … --season-year … --init-date … |
SystemExit: … stress_parquet.parquet when assemble_stress_from_indices is set |
Skip logic failed — this path should not be checked | Check _skip_stress_filepath_preflight at preflight.py:93 |
| Preflight passes but stage fails on missing file | A new ResolvablePath field was added to config but not typed correctly |
Ensure the field annotation uses ResolvablePath, not bare str or Path |
Cross-references¶
- ExperimentConfig — carries
check_data_exists, allResolvablePathfields,features_dir - Concept: ResolvablePath — the typing convention that drives automatic preflight coverage
- Pipeline: feature_build — calls check-set 1
- Pipeline: fit — check-set 2 runs before the hindcast walk-forward phase
- Pipeline: predict — check-set 4 gates forecast scoring
- Source: orchestration —
run/preflight.pyfull summary - Source: DESIGN.md — Clause 31 (preflight from
_iter_resolvable_fields)
PRs that materially changed this stage¶
- PR #369 (
f5399b96) — restructured forecast path fromforecast/{init_date}/toforecast/{season_year}/{init_date}/; updatedpreflight_paths_for_forecast_predictpath construction atpreflight.py:166–173. - PR #345 (
tl/fix-path-issues) — addedAnyPath-basedcheck_path_existsso S3 URIs are resolved correctly (DESIGN.md Clause 27).