Skip to content

Entity: ExperimentConfig

Definition

ExperimentConfig is the frozen, validated root configuration object for a single commodity pipeline run. It inherits pydantic_settings.BaseSettings and is the sole config authority passed to every stage: features, FIT, POSTPROCESS, EVALUATE, DELIVER, and FORECAST. All subordinate config blocks (CommodityConfig, ModelConfig, etc.) are nested fields. All ResolvablePath fields anywhere in the nested tree are resolved against data_root at construction time before any stage begins.

Kind

Pydantic BaseSettings subclass — root config aggregate. Frozen after construction (model_config sets frozen implicitly via the validator pattern; note that model_config here is a SettingsConfigDict, not {"frozen": True}). In practice the config is treated as immutable once written to config_resolved.yaml.

Source of truth

market_insights_models/src/commodity_hindcast/config.py:611

YAML loading order

settings_customise_sources at config.py:860 overrides the pydantic-settings default resolution chain. Priority from highest to lowest:

Priority Source Notes
1 (highest) CliSettingsSource Click CLI flags forwarded via _prepare_config() setting COMMODITY_HINDCAST_CONFIG env var
2 env_settings Environment variables; nested delimiter __ (e.g. MODEL__REGRESSION=ridge)
3 YamlConfigSettingsSource YAML file resolved by _experiment_config_yaml_path() — env var COMMODITY_HINDCAST_CONFIG > <project_root>/configs/config.yaml
4 (lowest) init_settings Pydantic field defaults

dotenv_settings and file_secret_settings are explicitly removed from the chain. cli_parse_args=False (config.py:630) prevents pydantic-settings from consuming sys.argv directly — Click handles CLI flags independently (issue #264).

Key attributes

Field Type Default Meaning YAML example
data_root AnyPath require_input_data_dir() Base directory; all relative paths anchor here. Read from INPUT_DATA_DIR env var (alias data_root) — (env var only)
experiment_name str required Slug-safe identifier used for MLflow experiment name and run_dir naming. Pattern: [a-zA-Z0-9_-]+ corn_yield_prediction
random_seed int 42 Global RNG seed for reproducibility 42
mlflow_tracking_uri str sqlite:///mlruns.db MLflow tracking backend URI sqlite:///mlruns.db
feature_start_year int 1980 Earliest year for which feature rows are built 1980
feature_end_year int 2025 Latest year (inclusive) for feature parquets 2025
check_data_exists list[str] [] Extra preflight paths that must exist before a run []
raw_dir AnyPath \| None data_root / "raw" Raw data directory; filled by _fill_defaults_from_data_root
features_dir AnyPath \| None data_root / "features" Features parquet directory
models_dir AnyPath \| None data_root / "models" Trained model artefacts directory
preds_dir AnyPath \| None data_root / "predictions" Fold prediction parquets directory
run_dir_base AnyPath \| None data_root / "runs" Root for all timestamped run_dirs
commodity CommodityConfig required Commodity-specific constants (calendar, builders, feature columns) inline dict or "corn" stem
experiment_protocol ExperimentProtocolConfig required Walk-forward CV schedule inline dict
model ModelConfig ModelConfig() Detrend strategy + regression estimator inline dict
reference_data list[ReferenceYieldSpec] [] External benchmark specs (WASDE / CONAB). Each name must be unique list of dicts
postprocess PostprocessConfig required Bias correction + conformal calibration settings inline dict
delivery DeliveryConfig required CI levels, public model name inline dict
forecast ForecastConfig \| None None Forecast-specific paths + residual_mode; None → hindcast mode inline dict or omit

Lifecycle

  1. cli._prepare_config() sets COMMODITY_HINDCAST_CONFIG env var to the resolved YAML path, then calls ExperimentConfig() (no model_validate; the BaseSettings constructor fires settings_customise_sources).
  2. Pydantic validators run in declaration order: _prepare_commodity (mode=before, resolves nested commodity YAML), then _fill_defaults_from_data_root, _resolve_data_paths, _reference_data_names_unique, ensure_dirs (all mode=after).
  3. Validated config is passed to stages/run_hindcast._create_run_root, which writes config_resolved.yaml and mutates models_dir/preds_dir in-place to point inside the new run_dir.
  4. All downstream stages receive the config from config_resolved.yaml via _load_config(run_dir) (cached per process).

Relationships

  • Root of the ExperimentConfig aggregate: directly contains CommodityConfig, ModelConfig, ExperimentProtocolConfig, PostprocessConfig (→ BiasCorrectorConfig), DeliveryConfig, ForecastConfig | None, list[ReferenceYieldSpec].
  • Held by ExperimentResult as its config field — the single in-memory config instance per run.
  • Loaded lazily by HindcastSlice and ForecastSlice via _load_config(run_dir) from config_resolved.yaml.
  • Consumed by every stage module, all builder functions, and all model factory methods.

Concepts and pipelines that touch this entity

PRs and commits

  • PR #361 (PR-361.md) — added postprocess.conformalise tuple; ForecastConfig gained a residual_mode placeholder.
  • PR #372 (PR-372.md) — made forecast.residual_mode mandatory; extracted ResidualMode to models/meta_models/types.py to avoid a circular import.

Open questions

  • build_detrender() and build_regressor() factory methods live on ExperimentConfig — acknowledged as a mis-placement in a TODO at config.py:719. Future refactor should move them to the model layer.
  • training_dropna_subset() and _fill_defaults_from_data_root are also noted as candidates for relocation (config.py:772).