Skip to content

Entity: ExperimentProtocolConfig

Definition

ExperimentProtocolConfig is the frozen Pydantic model that declares the cross-validation schedule for a commodity run. It specifies which years are held out for walk-forward evaluation (test_years), the CV strategy name (cv_strategy, always "expanding" today), and the production-fold inclusion threshold for county selection. It is consumed by ExpandingFoldGenerator in run/experiment_protocol.py to generate (fold_label, train_data, test_data, year_data, references_fold) tuples.

FoldSchedule in the dashboard layer (app/_dashboard_config.py) is a derived read-only view of this schedule and is out of scope for the core pipeline domain model.

Kind

Pydantic BaseModel (no frozen=True in source — model_config is not declared on this class; it inherits pydantic defaults). Nested inside ExperimentConfig.

Source of truth

market_insights_models/src/commodity_hindcast/config.py:483

Key attributes

Field Type Default Meaning YAML example
cv_strategy str required Walk-forward CV variant. Currently always "expanding". Declared as a string (not a Literal) to allow future extension without a schema break expanding
test_years list[int] required Ordered list of harvest years held out in sequence. Each entry produces one numeric fold_label (e.g. "2020") and its corresponding HindcastSlice [2020, 2021, 2022, 2023, 2024]
production_cumulative_threshold float 1.0 Top-N% of counties by recent production retained in the included_geo_identifiers universe. 0.95 → top 95%; 1.0 → all counties 0.95
production_recent_years int 5 Number of most-recent years used to rank counties for the cumulative threshold 5

Fold generation

ExpandingFoldGenerator (run/experiment_protocol.py:110) iterates test_years in sorted order. For each test_year:

  • Training data: fit_df[year < test_year] (all years strictly before the test year).
  • Test data: pred_df[year == test_year] (the hold-out year only).
  • Fold label: str(test_year) (e.g. "2020").
  • References fold: subset of each reference series for marketing_year == test_year.

After all numeric folds, a "production" fold is also generated, trained on all available data up to feature_end_year. The production fold has no test-year holdout.

cutoff for a numeric fold is date(int(fold_label), 1, 1); for the production fold it is date(feature_end_year + 1, 1, 1). (See lib/results/results_slice.py:151.)

Lifecycle

  1. Constructed as part of ExperimentConfig.
  2. Consumed by run/runner.run_walk_forward() which passes a DataFoldGenerator derived from this config.
  3. production_cumulative_threshold and production_recent_years are used during the FIT stage to select included_geo_identifiers — the county universe written to run_dir/included_geo_identifiers.txt.
  4. Persisted only as part of config_resolved.yaml; not a standalone artefact.

Relationships

  • Owned by ExperimentConfig (config.py:682).
  • Drives ExpandingFoldGenerator — one generator instance per run, constructed in run/runner.run_walk_forward.
  • Drives Fold / HindcastSlice cardinality: len(test_years) + 1 slices per run (numeric folds + production).
  • Inspected by FoldSchedule (dashboard layer only) to map season dates to available fold labels.

Concepts and pipelines that touch this entity

PRs and commits

  • PR #331 (PR-331.md) — production_cumulative_threshold and production_recent_years added to control the included-geo-identifiers selection.

Open questions

  • cv_strategy is declared as a plain str rather than a Literal["expanding"]. A future strategy (e.g. sliding window) would add a value here and a corresponding AbstractFoldGenerator subclass. No implementation exists yet.
  • production_cumulative_threshold defaults to 1.0 (all counties) at the class level, but all production YAMLs set it to 0.90 or 0.95. The default silently keeps all counties, which may inflate uncertainty at the tail; the intent was to make the threshold explicit in every config.