MLflow Tracking¶

What it is¶

Every run hindcast, run fit-production, and run forecast invocation opens an MLflow run. The tracking layer is implemented as a set of helpers in lib/tracking/ that isolate stage code from direct MLflow API calls. The key components are:

tracking_uri_anchored — pins the SQLite database path next to runs/ and features/
configure_tracking — sets the MLflow tracking URI and experiment name
prepare_hindcast_mlflow — seeds RNG, writes run YAMLs, sets up tracking
hindcast_mlflow_run — context manager that wraps mlflow.start_run with tags, params, and initial artefact uploads
log_artifact / log_artifacts — S3-aware wrappers that stage cloud paths to a temp directory before calling mlflow.log_artifact*

Design authority: DESIGN.md Clause 7.

Where it lives¶

Symbol	File	Line
`tracking_uri_anchored`	`lib/tracking/decorators.py`	43
`configure_tracking`	`lib/tracking/decorators.py`	81
`mlflow_fold_report_logger`	`lib/tracking/decorators.py`	87
`prepare_hindcast_mlflow`	`lib/tracking/decorators.py`	102
`hindcast_mlflow_run`	`lib/tracking/decorators.py`	129
`log_artifact`	`lib/tracking/log.py`	48
`log_artifacts`	`lib/tracking/log.py`	75
`bounded_hindcast_params`	`lib/tracking/log.py`	112
`log_hindcast_dataset_artifacts`	`lib/tracking/log.py`	129
`data_file_sha256_prefix`	`lib/tracking/log.py`	96
`capture_git`	`lib/tracking/log.py`	151
`capture_environment`	`lib/tracking/log.py`	175
`seed_everything`	`lib/tracking/log.py`	204

SQLite default backend¶

The default mlflow_tracking_uri in ExperimentConfig is sqlite:///mlruns.db. tracking_uri_anchored (decorators.py:43) detects that this is a relative SQLite URI and anchors it at config.data_root (i.e. INPUT_DATA_DIR):

if not p.is_absolute():
    resolved = (anchor / p).resolve()
    return f"sqlite:///{resolved.as_posix()}"

This keeps mlruns.db next to runs/ and features/ rather than resolving against whichever directory the CLI was invoked from (a historical source of scattered files).

Exception: when data_root is a CloudPath (QA environment with INPUT_DATA_DIR=s3://...), SQLite cannot live on object storage. The function detects isinstance(anchor, CloudPath), logs a warning, and leaves the URI cwd-relative. The warning text tells operators to set an absolute local path (e.g. sqlite:////tmp/mlruns.db) or an HTTP(S) MLflow tracking server.

To inspect runs locally: uv run mlflow ui --backend-store-uri sqlite:///mlruns.db.

`hindcast_mlflow_run` — the main context manager¶

decorators.py:129 is a @contextmanager that opens one MLflow run per pipeline invocation:

@contextmanager
def hindcast_mlflow_run(*, run_name, config, run_root, config_path, git_meta, env_meta):
    tags = {"stage": "hindcast", "run_dir": str(run_root), "config_path": str(config_path)}
    tags.update({k: str(v) for k, v in git_meta.items()})
    tags.update({k: str(v) for k, v in env_meta.items()})
    with mlflow.start_run(run_name=run_name):
        mlflow.set_tags(tags)
        mlflow.log_params(bounded_hindcast_params(config))
        log_artifact(str(run_root / "config_resolved.yaml"))
        log_artifact(str(run_root / "metadata.yaml"))
        yield

Tags set on each run: - stage — always "hindcast" - run_dir — the absolute path (or S3 URI) of the run directory - config_path — the YAML config file path used for this invocation - git_commit, git_short, git_dirty — from capture_git() - Python version, platform, and key package versions — from capture_environment()

`bounded_hindcast_params` — the params logged¶

log.py:112 produces the small per-run param dict logged alongside the full YAML artefact:

{
    "random_seed": ...,
    "data_root": ...,
    "feature_start_year": ...,
    "feature_end_year": ...,
    "experiment_name": ...,
    "experiment_key": ...,
    "detrend": ...,
    "regression": ...,
    "production_cumulative_threshold": ...,
}

The full resolved config is always logged as config_resolved.yaml alongside these summary params (DESIGN.md Clause 12: dual persistence).

Artefact tagging convention¶

Artefacts are uploaded under namespaced artifact_path prefixes:

Path	Contents	Logged by
(root)	`config_resolved.yaml`, `metadata.yaml`	`hindcast_mlflow_run`
`reports/folds/{fold_key}/`	Per-fold detrend/report PNGs	`mlflow_fold_report_logger`
`datasets/`	`fit.parquet`, `pred.parquet`	`log_hindcast_dataset_artifacts`
`datasets/{spec.name}/`	Per-spec reference data files	`log_hindcast_dataset_artifacts`

Multiple reference_data specs (e.g. CONAB-final + CONAB-LEV for Brazil soy) are separated by spec name so they do not collide on artefact path (log.py:139–148).

mlflow_fold_report_logger (decorators.py:87) returns an on_saved callback for per-fold PNG writes. If no MLflow run is active when on_saved fires, it logs a warning and skips the upload — this prevents crashes when plots are generated outside a tracking context.

S3-aware `log_artifact` / `log_artifacts`¶

MLflow's own log_artifact* functions only accept local paths. When run_dir is an S3 URI, all artefact paths are S3 paths. log.py:48 and log.py:75 handle this by staging the S3 object or directory to a tempfile.TemporaryDirectory before calling mlflow.log_artifact*:

with tempfile.TemporaryDirectory() as tmp:
    cp.download_to(dest)
    mlflow.log_artifact(str(dest), artifact_path=artifact_path)

DESIGN.md Clause 7 — key requirements¶

"WHEN an experiment runs, the system SHALL track it with MLflow (mlflow>=3, hard dependency). In create mode, a new MLflow run is started; in resume mode, the existing mlflow_run_id from metadata_<stage>.yaml is used to resume the same MLflow run."

Additional requirements from Clause 7: - MLflow params are prefixed by stage name (e.g. train/random_seed) to avoid write-once collisions on resume. - Training scripts should call mlflow.autolog(log_models=False) — models are not logged via autolog; callers use mlflow.<flavour>.log_model() directly. - run forecast --run-dir D resumes the MLflow run identified by the mlflow_run_id recorded in metadata_<stage>.yaml — no new run per init date.

Parallel-run DB-locking issue¶

Concurrent pipeline runs for the same commodity that share the same mlruns.db file can cause a SQLAlchemy OperationalError due to SQLite's single-writer lock. SQLite does not support concurrent writes from multiple processes. The symptom is a failed MLflow write in one of the concurrent processes, which causes that pipeline invocation to terminate early.

The technical mitigation is straightforward: run same-commodity pipelines sequentially rather than in parallel. Different-commodity pipelines sharing the same mlruns.db are less likely to conflict because SQLite uses file-level locking and acquisitions are brief, but concurrent hindcast runs for the same experiment key should be avoided.

Key invariants¶

tracking_uri_anchored is the only place that resolves a relative SQLite URI. Stage code never calls mlflow.set_tracking_uri directly.
hindcast_mlflow_run is a context manager; the yield point is where stage code runs. If the stage raises, MLflow marks the run as FAILED automatically.
bounded_hindcast_params produces a small, stable dict; the full config is always logged as a YAML artefact separately. This follows Clause 12 (dual persistence).

How it interacts with the pipeline¶

prepare_hindcast_mlflow is called once at the start of run_hindcast.run(). It seeds the RNG, writes config_resolved.yaml and metadata.yaml, and configures the MLflow tracking URI and experiment name. hindcast_mlflow_run is then entered and wraps the entire stage execution. Per-fold detrend plots are uploaded incrementally via the mlflow_fold_report_logger callback passed down to the fit stage.

Pitfalls¶

In QA (S3 data_root), the SQLite URI is left cwd-relative and mlruns.db resolves against the process cwd. This can scatter databases if the CLI is invoked from different directories across runs.
mlflow.autolog(log_models=False) is a recommendation in DESIGN.md, not enforced; if a stage calls mlflow.autolog() without log_models=False, large model artefacts may be uploaded unintentionally.

RunDir — MLflow artefact layout mirrors the run_dir layout on disk
s3_path_safety — log_artifact stages S3 objects via temp dir
input_data_dir_contract — data_root is the SQLite anchor
DESIGN.md — Clause 7 (tracking), Clause 12 (dual persistence)

Open questions¶

Resume mode (mlflow_run_id in metadata_<stage>.yaml) is specified in DESIGN.md Clause 7 but the current hindcast_mlflow_run always starts a fresh run; resume appears to be implemented only for the forecast sub-pipeline.
There is no test asserting that bounded_hindcast_params keys are stable across config schema changes; a field rename could silently break the run comparison view in the MLflow UI.