MLflow SQLite DB lock / corruption recovery¶

Scope: commodity_hindcast only. Every commodity YAML in
market_insights_models/src/commodity_hindcast/configs/ ships with
mlflow_tracking_uri: "sqlite:///mlruns.db" (corn, cotton, soybeans_usa,
soybeans_bra, wheat). The five configs share a single backing SQLite file,
which is anchored under data_root (INPUT_DATA_DIR, the repo root)
by tracking_uri_anchored() in
market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py.
That shared file is the asset this runbook protects.
1. When to use¶
Use this runbook if any of the following symptoms appear:
- A pipeline run aborts with
sqlalchemy.exc.OperationalErrororiginating from anmlflow.start_run,mlflow.log_*, ormlflow.set_tagscall. - The error message contains
database is lockedordatabase disk image is malformed. - The MLflow UI (started via
mlflow ui --backend-store-uri sqlite:///mlruns.db) returns HTTP 500 or fails to render the experiment list. - Two pipelines for the same commodity were started in parallel and both
failed. This is the documented failure mode in the project memory:
"Running two pipelines for the same commodity in parallel can cause
SQLAlchemy OperationalError due to concurrent MLflow writes" — see
MEMORY.md("Known Issues" section). mlruns.dbis unexpectedly missing or zero bytes at the repo root.
If you are not sure which branch applies, run the integrity check in section 4 first; it is read-only and decides between Branch A and Branch B.
2. Preconditions¶
Before starting:
- You are on the EC2 host with the repository checked out at
/data/processing/github/treefera-market-insights. - You have read and write access to
mlruns.dband to the repo root (thecpand recovery commands write next to the live DB). - You can list and stop the running CLI processes —
ps,pgrep, andkillare sufficient; no sudo is required for processes you started. sqlite3is onPATH(standard on Ubuntu; check withwhich sqlite3).- You know the path to the most recent backup. The convention is
mlruns.db.bak-YYYYMMDD-HHMMSSnext to the live DB; the worktree currently containsmlruns.db.bak-20260506-091937as a reference example. INPUT_DATA_DIRis set to the repo root (per project memory: "commodity_hindcast:/data/processing/github/treefera-market-insights"). Without it the smoke test in section 4 will fail loudly viarequire_input_data_dir().
3. Procedure¶
The three branches are mutually exclusive: pick one based on the symptom and the integrity-check result.
Branch A — transient lock (most common)¶
Symptom: OperationalError: database is locked. The DB itself is fine;
a writer is holding the file lock.
- Identify processes holding the DB open.
fuserlists every PID with an open handle on the file (seeman fuser): As a fallback,lsofgives the same information with command names: - Cross-check against active CLI invocations:
- Decide whether to wait or kill:
- If exactly one writer is running and it is making progress (CPU > 0, log file growing), wait for it. Lock contention from a single writer normally clears within seconds.
- If two or more writers are running for the same commodity, this
is the parallel-pipeline failure mode from
MEMORY.md. Stop all but one; let the survivor finish. - If a writer is stuck (no log progress for several minutes, no CPU),
terminate it with
kill <PID>(SIGTERM). Escalate tokill -9 <PID>only if SIGTERM does not clear the handle within ~30 seconds. - Verify the database is intact before resuming work.
PRAGMA integrity_checkis the SQLite-blessed way to confirm there is no page-level damage; it returns the literal stringokon a healthy DB (https://www.sqlite.org/pragma.html#pragma_integrity_check): If the result is anything other thanok, switch to Branch B. - Resume work serialised. Per
MEMORY.md, do not start a second pipeline for the same commodity until the first has exited. Different commodities in parallel are documented as risky on a shared SQLite file (all five configs point at the samemlruns.db); prefer serialising any same-host MLflow writes.
Branch B — corruption¶
Symptom: PRAGMA integrity_check returns errors, or MLflow logs
database disk image is malformed, or the MLflow UI crashes on load.
- Stop every pipeline writing to MLflow before touching the file:
Confirm with
fuser -v mlruns.dbthat no writer remains. - Preserve the broken file for forensics — do not overwrite it:
- Pick the most recent good backup. List candidates by mtime
(newest last) and confirm with the on-call owner before restoring:
The current example in the worktree is
mlruns.db.bak-20260506-091937. - Restore:
- Verify the restored file:
- Re-run any pipeline runs whose results were lost between the backup
timestamp and the corruption event. Use the run YAMLs and
metadata.yamlartefacts on disk underruns/to identify what was in flight at the time (these are written byprepare_hindcast_mlflowinlib/tracking/decorators.pyregardless of MLflow state).
Branch C — DB missing¶
Symptom: mlruns.db is gone or zero bytes; MLflow raises
OperationalError: unable to open database file or logs no such
table: experiments.
[PLACEHOLDER: confirm whether MLflow auto-creates a fresh SQLite store
on the next mlflow.set_tracking_uri + run-start when the file is
absent. If it does, the operational choice is to accept loss of run
history for that experiment and start fresh. If it does not, the only
option is to copy the nearest backup as in Branch B step 4.]
Until the placeholder is resolved, default to Branch B step 4: restore
from the latest mlruns.db.bak-* rather than relying on auto-create.
Document the gap in run history in the run-log channel
[PLACEHOLDER: slack_channel].
4. Verification¶
Run all three checks; all must pass before declaring the system recovered.
- Schema is intact:
Expect MLflow's standard tables — at minimum
experiments,runs,metrics,params,tags,latest_metrics, and thealembic_versionrow. - Smoke-test a real run end-to-end against a small commodity config.
From the repo root:
The run must complete without raising
export INPUT_DATA_DIR=/data/processing/github/treefera-market-insights uv run python -m market_insights_models.src.commodity_hindcast.cli \ run hindcast --config market_insights_models/src/commodity_hindcast/configs/corn_usa.yamlOperationalErrorand must produce a new row inruns(check viasqlite3 mlruns.db "SELECT COUNT(*) FROM runs;"before and after). - UI accessible:
Browse to
http://127.0.0.1:5000, confirm the experiment list renders, and open one run.
5. Prevention¶
- Never run two pipelines for the same commodity in parallel. This
is the single documented MLflow failure mode for the project — see
MEMORY.md"Known Issues". Serialise same-commodity invocations. - All five commodity YAMLs ship
mlflow_tracking_uri: "sqlite:///mlruns.db"(configs/corn_usa.yaml:9,configs/cotton_usa.yaml:10,configs/soybeans_usa.yaml:10,configs/soybeans_bra.yaml:20,configs/wheat_usa.yaml:11). They resolve to the same file becausetracking_uri_anchored()anchors the relative URI underdata_root(lib/tracking/decorators.py:43-78). Cross-commodity parallelism is therefore also writing into one SQLite file — treat it as risky and serialise where practical. - If a future migration to a real tracking server is approved, swap the
YAML field to an HTTP(S) URI;
tracking_uri_anchoredpasses those through unchanged (lib/tracking/decorators.py:59-60). That removes the SQLite single-writer bottleneck entirely. - Keep
INPUT_DATA_DIRset to the repo root for every shell that invokes the CLI; otherwise the DB ends up scattered per-cwd (see the warning atlib/tracking/decorators.py:65-75).
6. Backup policy¶
[PLACEHOLDER: document the backup cadence (suggested: hourly during business hours, daily otherwise), retention window (suggested: 30 days of timestamped copies plus monthly archives), the cron job or systemd timer that produces the backups, and the on-call owner who is responsible for verifying it.]
The naming convention to preserve is the one already in use in the
repository: mlruns.db.bak-YYYYMMDD-HHMMSS, UTC. The reference example
present in the worktree is mlruns.db.bak-20260506-091937. Any
automation that produces backups must:
- Copy with
cp -a(preserve timestamps) or usesqlite3 .backupfor an online-safe copy (https://www.sqlite.org/cli.html#special_commands_to_sqlite3_dot_commands_):.backupis the supported online-backup command and is safe to run while writers are connected; a plaincpof a file with active writers can capture a torn page. - Write the backup next to
mlruns.dbat the repo root so this runbook finds it vials mlruns.db.bak-*. - Log the backup timestamp to [PLACEHOLDER: slack_channel] so on-call knows the freshness window during an incident.