MLflow Tracking¶
What it is¶
Every run hindcast, run fit-production, and run forecast invocation opens an
MLflow run. The tracking layer is implemented as a set of helpers in
lib/tracking/ that isolate stage code from direct MLflow API calls. The key
components are:
tracking_uri_anchored— pins the SQLite database path next toruns/andfeatures/configure_tracking— sets the MLflow tracking URI and experiment nameprepare_hindcast_mlflow— seeds RNG, writes run YAMLs, sets up trackinghindcast_mlflow_run— context manager that wrapsmlflow.start_runwith tags, params, and initial artefact uploadslog_artifact/log_artifacts— S3-aware wrappers that stage cloud paths to a temp directory before callingmlflow.log_artifact*
Design authority: DESIGN.md Clause 7.
Where it lives¶
| Symbol | File | Line |
|---|---|---|
tracking_uri_anchored |
lib/tracking/decorators.py |
43 |
configure_tracking |
lib/tracking/decorators.py |
81 |
mlflow_fold_report_logger |
lib/tracking/decorators.py |
87 |
prepare_hindcast_mlflow |
lib/tracking/decorators.py |
102 |
hindcast_mlflow_run |
lib/tracking/decorators.py |
129 |
log_artifact |
lib/tracking/log.py |
48 |
log_artifacts |
lib/tracking/log.py |
75 |
bounded_hindcast_params |
lib/tracking/log.py |
112 |
log_hindcast_dataset_artifacts |
lib/tracking/log.py |
129 |
data_file_sha256_prefix |
lib/tracking/log.py |
96 |
capture_git |
lib/tracking/log.py |
151 |
capture_environment |
lib/tracking/log.py |
175 |
seed_everything |
lib/tracking/log.py |
204 |
SQLite default backend¶
The default mlflow_tracking_uri in ExperimentConfig is sqlite:///mlruns.db.
tracking_uri_anchored (decorators.py:43) detects that this is a relative SQLite URI
and anchors it at config.data_root (i.e. INPUT_DATA_DIR):
This keeps mlruns.db next to runs/ and features/ rather than resolving against
whichever directory the CLI was invoked from (a historical source of scattered files).
Exception: when data_root is a CloudPath (QA environment with
INPUT_DATA_DIR=s3://...), SQLite cannot live on object storage. The function detects
isinstance(anchor, CloudPath), logs a warning, and leaves the URI cwd-relative. The
warning text tells operators to set an absolute local path
(e.g. sqlite:////tmp/mlruns.db) or an HTTP(S) MLflow tracking server.
To inspect runs locally: uv run mlflow ui --backend-store-uri sqlite:///mlruns.db.
hindcast_mlflow_run — the main context manager¶
decorators.py:129 is a @contextmanager that opens one MLflow run per pipeline
invocation:
@contextmanager
def hindcast_mlflow_run(*, run_name, config, run_root, config_path, git_meta, env_meta):
tags = {"stage": "hindcast", "run_dir": str(run_root), "config_path": str(config_path)}
tags.update({k: str(v) for k, v in git_meta.items()})
tags.update({k: str(v) for k, v in env_meta.items()})
with mlflow.start_run(run_name=run_name):
mlflow.set_tags(tags)
mlflow.log_params(bounded_hindcast_params(config))
log_artifact(str(run_root / "config_resolved.yaml"))
log_artifact(str(run_root / "metadata.yaml"))
yield
Tags set on each run:
- stage — always "hindcast"
- run_dir — the absolute path (or S3 URI) of the run directory
- config_path — the YAML config file path used for this invocation
- git_commit, git_short, git_dirty — from capture_git()
- Python version, platform, and key package versions — from capture_environment()
bounded_hindcast_params — the params logged¶
log.py:112 produces the small per-run param dict logged alongside the full YAML
artefact:
{
"random_seed": ...,
"data_root": ...,
"feature_start_year": ...,
"feature_end_year": ...,
"experiment_name": ...,
"experiment_key": ...,
"detrend": ...,
"regression": ...,
"production_cumulative_threshold": ...,
}
The full resolved config is always logged as config_resolved.yaml alongside these
summary params (DESIGN.md Clause 12: dual persistence).
Artefact tagging convention¶
Artefacts are uploaded under namespaced artifact_path prefixes:
| Path | Contents | Logged by |
|---|---|---|
| (root) | config_resolved.yaml, metadata.yaml |
hindcast_mlflow_run |
reports/folds/{fold_key}/ |
Per-fold detrend/report PNGs | mlflow_fold_report_logger |
datasets/ |
fit.parquet, pred.parquet |
log_hindcast_dataset_artifacts |
datasets/{spec.name}/ |
Per-spec reference data files | log_hindcast_dataset_artifacts |
Multiple reference_data specs (e.g. CONAB-final + CONAB-LEV for Brazil soy) are
separated by spec name so they do not collide on artefact path (log.py:139–148).
mlflow_fold_report_logger (decorators.py:87) returns an on_saved callback for
per-fold PNG writes. If no MLflow run is active when on_saved fires, it logs a
warning and skips the upload — this prevents crashes when plots are generated outside
a tracking context.
S3-aware log_artifact / log_artifacts¶
MLflow's own log_artifact* functions only accept local paths. When run_dir is an
S3 URI, all artefact paths are S3 paths. log.py:48 and log.py:75 handle this by
staging the S3 object or directory to a tempfile.TemporaryDirectory before calling
mlflow.log_artifact*:
with tempfile.TemporaryDirectory() as tmp:
cp.download_to(dest)
mlflow.log_artifact(str(dest), artifact_path=artifact_path)
DESIGN.md Clause 7 — key requirements¶
"WHEN an experiment runs, the system SHALL track it with MLflow (
mlflow>=3, hard dependency). In create mode, a new MLflow run is started; in resume mode, the existingmlflow_run_idfrommetadata_<stage>.yamlis used to resume the same MLflow run."
Additional requirements from Clause 7:
- MLflow params are prefixed by stage name (e.g. train/random_seed) to avoid
write-once collisions on resume.
- Training scripts should call mlflow.autolog(log_models=False) — models are not
logged via autolog; callers use mlflow.<flavour>.log_model() directly.
- run forecast --run-dir D resumes the MLflow run identified by the
mlflow_run_id recorded in metadata_<stage>.yaml — no new run per init date.
Parallel-run DB-locking issue¶
Concurrent pipeline runs for the same commodity that share the same mlruns.db
file can cause a SQLAlchemy OperationalError due to SQLite's single-writer lock.
SQLite does not support concurrent writes from multiple processes. The symptom is a
failed MLflow write in one of the concurrent processes, which causes that pipeline
invocation to terminate early.
The technical mitigation is straightforward: run same-commodity pipelines sequentially
rather than in parallel. Different-commodity pipelines sharing the same mlruns.db
are less likely to conflict because SQLite uses file-level locking and acquisitions are
brief, but concurrent hindcast runs for the same experiment key should be avoided.
Key invariants¶
tracking_uri_anchoredis the only place that resolves a relative SQLite URI. Stage code never callsmlflow.set_tracking_uridirectly.hindcast_mlflow_runis a context manager; theyieldpoint is where stage code runs. If the stage raises, MLflow marks the run asFAILEDautomatically.bounded_hindcast_paramsproduces a small, stable dict; the full config is always logged as a YAML artefact separately. This follows Clause 12 (dual persistence).
How it interacts with the pipeline¶
prepare_hindcast_mlflow is called once at the start of run_hindcast.run(). It
seeds the RNG, writes config_resolved.yaml and metadata.yaml, and configures the
MLflow tracking URI and experiment name. hindcast_mlflow_run is then entered and
wraps the entire stage execution. Per-fold detrend plots are uploaded incrementally via
the mlflow_fold_report_logger callback passed down to the fit stage.
Pitfalls¶
- In QA (S3
data_root), the SQLite URI is left cwd-relative andmlruns.dbresolves against the process cwd. This can scatter databases if the CLI is invoked from different directories across runs. mlflow.autolog(log_models=False)is a recommendation in DESIGN.md, not enforced; if a stage callsmlflow.autolog()withoutlog_models=False, large model artefacts may be uploaded unintentionally.
Related entities and concepts¶
- RunDir — MLflow artefact layout mirrors the
run_dirlayout on disk - s3_path_safety —
log_artifactstages S3 objects via temp dir - input_data_dir_contract —
data_rootis the SQLite anchor - DESIGN.md — Clause 7 (tracking), Clause 12 (dual persistence)
Open questions¶
- Resume mode (
mlflow_run_idinmetadata_<stage>.yaml) is specified in DESIGN.md Clause 7 but the currenthindcast_mlflow_runalways starts a fresh run; resume appears to be implemented only for the forecast sub-pipeline. - There is no test asserting that
bounded_hindcast_paramskeys are stable across config schema changes; a field rename could silently break the run comparison view in the MLflow UI.