Skip to content

Pipeline: Preflight

Purpose

Preflight is the guard layer that runs immediately before any compute-intensive stage. It walks every ResolvablePath field on the ExperimentConfig tree, resolves each against data_root, and calls check_path_exists (cloud-safe via cloudpathlib.AnyPath). On the first critical failure, run_preflight raises SystemExit with a message naming the missing path. This ensures a run never starts only to fail 30 minutes later deep inside a builder or model loader. Failing loud at config-load time is a first-class design decision (DESIGN.md Clause 31).

Inputs

Input Path Format Producer
ExperimentConfig In-memory from YAML + env Pydantic model _prepare_config() in cli.py:112
Raw data files Various — resolved via config.data_root .parquet, .zarr, .csv, .pkl External sources (NASS, weather zarr, climo zarr)
Feature parquets (hindcast/forecast stages) {features_dir}/{experiment_key}/fit.parquet + pred.parquet Parquet run features / build_features
Production model artefacts (forecast-predict) {run_dir}/models/{experiment_key}/production/detrender.pkl + feature_fill_values.parquet Pickle / Parquet run_fit stage
Forecast features parquet (forecast-predict) {run_dir}/forecast/{season_year}/{init_date}/features/pred.parquet Parquet run_forecast.run_features

Outputs

Preflight has no file outputs. It either returns normally (all checks pass) or raises SystemExit (first critical failure). Log lines at SUCCESS / ERROR level are emitted for every check.

Step-by-step flow

The five public check-set functions share the same execution model: build a list[Check], pass it to run_preflight, which iterates and halts on the first critical failure.

Check-set 1 — preflight_paths_for_features (preflight.py:110)

Called by cli.run_features_cmd before build_features.

  1. Iterate _iter_resolvable_fields(config) to yield every (owner, field_name) pair typed ResolvablePath (preflight.py:122).
  2. Skip the stress builder's filepath when assemble_stress_from_indices is configured, because that parquet is produced during this stage — _skip_stress_filepath_preflight (preflight.py:93) applies duck-typing on owner.type == "stress" and owner.assemble_stress_from_indices is not None.
  3. For each remaining non-None resolved value, emit check_path_exists(value) (preflight.py:125–128).

Check-set 2 — preflight_paths_for_hindcast (preflight.py:59)

Called by run_hindcast.run() before the walk-forward phase.

  1. Iterate config.check_data_exists entries, resolving each via _resolve_check_path (preflight.py:67–68).
  2. Check {features_dir}/{experiment_key}/fit.parquet — training features must pre-exist (preflight.py:72).
  3. Check {features_dir}/{experiment_key}/pred.parquet — prediction features must pre-exist (preflight.py:73).

Reference parquet paths (from config.reference_data[i].filepath) are covered automatically by the ResolvablePath walker rather than a hand-maintained list (DESIGN.md Clause 31).

Check-set 3 — preflight_paths_for_forecast_features (preflight.py:132)

Called by run_forecast.run_features() before materialising forecast indices.

  1. preflight_paths_for_resolvable_inputs(config) — all ResolvablePath fields including raw_obs_filepath and materialised_climo_filepath (preflight.py:141).
  2. Check canonical pred.parquet at {features_dir}/{experiment_key}/pred.parquet — required for area imputation in _impute_forecast_area (preflight.py:142–143; DESIGN.md line 114).

Check-set 4 — preflight_paths_for_forecast_predict (preflight.py:147)

Called by run_forecast.run_predict() before scoring.

  1. preflight_paths_for_resolvable_inputs(config) — all ResolvablePath fields (preflight.py:163).
  2. Check {run_dir}/forecast/{season_year}/{init_date:%Y-%m-%d}/features/pred.parquet — per-(season_year, init_date) forecast features built by the prior sub-stage (preflight.py:166–174).
  3. Check {run_dir}/models/{experiment_key}/production/detrender.pkl (preflight.py:177).
  4. Check {run_dir}/models/{experiment_key}/production/feature_fill_values.parquet (preflight.py:178).

Check-set 5 — preflight_paths_for_export (preflight.py:205)

Called by cli.run_export_cmd before export.

  1. preflight_paths_for_resolvable_inputs(config) — resolvable inputs only (preflight.py:207).

A sixth convenience function preflight_paths_for_forecast (preflight.py:183) is the union of check-sets 3 and 4; used by the thin run_forecast composer.

Mermaid flow diagram

flowchart LR
    CFG["ExperimentConfig\n(data_root resolved)"]
    ITER["_iter_resolvable_fields(config)\npreflight.py:85"]
    C1["preflight_paths_for_features\npreflight.py:110"]
    C2["preflight_paths_for_hindcast\npreflight.py:59"]
    C3["preflight_paths_for_forecast_features\npreflight.py:132"]
    C4["preflight_paths_for_forecast_predict\npreflight.py:147"]
    C5["preflight_paths_for_export\npreflight.py:205"]
    RP["run_preflight(checks)\npreflight.py:42"]
    OK["Stage proceeds"]
    FAIL["SystemExit\n(critical check failed)"]

    CFG --> ITER
    ITER --> C1
    ITER --> C2
    ITER --> C3
    ITER --> C4
    ITER --> C5
    C1 --> RP
    C2 --> RP
    C3 --> RP
    C4 --> RP
    C5 --> RP
    RP -->|all pass| OK
    RP -->|first critical fail| FAIL

Invariants and contracts

DESIGN.md Clause 31 (verbatim):

"WHEN a config field is typed ResolvablePath, the system SHALL assert the resolved target exists via check_path_exists in a per-stage preflight_paths_for_<stage> invoked before the consuming stage executes. The check set must be derived from _iter_resolvable_fields(config), not hand-maintained, so that adding a new ResolvablePath field extends preflight coverage by construction."

Every check_path_exists call sets critical=True (preflight.py:38). run_preflight stops on the first failure — it does not accumulate all failures before raising. Non-critical checks (warnings) are supported by the Check dataclass but none of the current check-set functions emit them.

Failure modes and recovery

Symptom Likely cause Recovery
SystemExit: Critical preflight check failed: path_exists:{features_dir}/corn_usa/fit.parquet Features not yet built Run make features or cli run features --config corn_usa
SystemExit: … detrender.pkl Production fit not present Run make fit-production RUN_DIR=… or cli run fit-production
SystemExit: … forecast/{y}/{d}/features/pred.parquet Forecast features not built Run cli run forecast-features --run-dir … --season-year … --init-date …
SystemExit: … stress_parquet.parquet when assemble_stress_from_indices is set Skip logic failed — this path should not be checked Check _skip_stress_filepath_preflight at preflight.py:93
Preflight passes but stage fails on missing file A new ResolvablePath field was added to config but not typed correctly Ensure the field annotation uses ResolvablePath, not bare str or Path

Cross-references

PRs that materially changed this stage

  • PR #369 (f5399b96) — restructured forecast path from forecast/{init_date}/ to forecast/{season_year}/{init_date}/; updated preflight_paths_for_forecast_predict path construction at preflight.py:166–173.
  • PR #345 (tl/fix-path-issues) — added AnyPath-based check_path_exists so S3 URIs are resolved correctly (DESIGN.md Clause 27).