MLflow SQLite DB lock / corruption recovery¶

Failure-mode flowchart — MLflow DB recovery

Scope: commodity_hindcast only. Every commodity YAML in market_insights_models/src/commodity_hindcast/configs/ ships with mlflow_tracking_uri: "sqlite:///mlruns.db" (corn, cotton, soybeans_usa, soybeans_bra, wheat). The five configs share a single backing SQLite file, which is anchored under data_root (INPUT_DATA_DIR, the repo root) by tracking_uri_anchored() in market_insights_models/src/commodity_hindcast/lib/tracking/decorators.py. That shared file is the asset this runbook protects.

1. When to use¶

Use this runbook if any of the following symptoms appear:

A pipeline run aborts with sqlalchemy.exc.OperationalError originating from an mlflow.start_run, mlflow.log_*, or mlflow.set_tags call.
The error message contains database is locked or database disk image is malformed.
The MLflow UI (started via mlflow ui --backend-store-uri sqlite:///mlruns.db) returns HTTP 500 or fails to render the experiment list.
Two pipelines for the same commodity were started in parallel and both failed. This is the documented failure mode in the project memory: "Running two pipelines for the same commodity in parallel can cause SQLAlchemy OperationalError due to concurrent MLflow writes" — see MEMORY.md ("Known Issues" section).
mlruns.db is unexpectedly missing or zero bytes at the repo root.

If you are not sure which branch applies, run the integrity check in section 4 first; it is read-only and decides between Branch A and Branch B.

2. Preconditions¶

Before starting:

You are on the EC2 host with the repository checked out at /data/processing/github/treefera-market-insights.
You have read and write access to mlruns.db and to the repo root (the cp and recovery commands write next to the live DB).
You can list and stop the running CLI processes — ps, pgrep, and kill are sufficient; no sudo is required for processes you started.
sqlite3 is on PATH (standard on Ubuntu; check with which sqlite3).
You know the path to the most recent backup. The convention is mlruns.db.bak-YYYYMMDD-HHMMSS next to the live DB; the worktree currently contains mlruns.db.bak-20260506-091937 as a reference example.
INPUT_DATA_DIR is set to the repo root (per project memory: "commodity_hindcast: /data/processing/github/treefera-market-insights"). Without it the smoke test in section 4 will fail loudly via require_input_data_dir().

3. Procedure¶

The three branches are mutually exclusive: pick one based on the symptom and the integrity-check result.

Branch A — transient lock (most common)¶

Symptom: OperationalError: database is locked. The DB itself is fine; a writer is holding the file lock.

Identify processes holding the DB open. fuser lists every PID with an open handle on the file (see man fuser):
```
fuser -v /data/processing/github/treefera-market-insights/mlruns.db
```
As a fallback, lsof gives the same information with command names:
```
lsof /data/processing/github/treefera-market-insights/mlruns.db
```
Cross-check against active CLI invocations:
```
pgrep -af 'commodity_hindcast|mlflow'
```
Decide whether to wait or kill:
If exactly one writer is running and it is making progress (CPU > 0, log file growing), wait for it. Lock contention from a single writer normally clears within seconds.
If two or more writers are running for the same commodity, this is the parallel-pipeline failure mode from MEMORY.md. Stop all but one; let the survivor finish.
If a writer is stuck (no log progress for several minutes, no CPU), terminate it with kill <PID> (SIGTERM). Escalate to kill -9 <PID> only if SIGTERM does not clear the handle within ~30 seconds.
Verify the database is intact before resuming work. PRAGMA integrity_check is the SQLite-blessed way to confirm there is no page-level damage; it returns the literal string ok on a healthy DB (https://www.sqlite.org/pragma.html#pragma_integrity_check):
```
sqlite3 /data/processing/github/treefera-market-insights/mlruns.db \
  "PRAGMA integrity_check;"
```
If the result is anything other than ok, switch to Branch B.
Resume work serialised. Per MEMORY.md, do not start a second pipeline for the same commodity until the first has exited. Different commodities in parallel are documented as risky on a shared SQLite file (all five configs point at the same mlruns.db); prefer serialising any same-host MLflow writes.

Branch B — corruption¶

Symptom: PRAGMA integrity_check returns errors, or MLflow logs database disk image is malformed, or the MLflow UI crashes on load.

Stop every pipeline writing to MLflow before touching the file:
```
pgrep -af 'commodity_hindcast|mlflow' | awk '{print $1}' | xargs -r kill
```
Confirm with fuser -v mlruns.db that no writer remains.

Preserve the broken file for forensics — do not overwrite it:

cd /data/processing/github/treefera-market-insights
cp -a mlruns.db "mlruns.db.broken-$(date -u +%Y%m%d-%H%M%S)"

Pick the most recent good backup. List candidates by mtime (newest last) and confirm with the on-call owner before restoring:
```
ls -lt mlruns.db.bak-* | head
```
The current example in the worktree is mlruns.db.bak-20260506-091937.

Restore:

cp -a mlruns.db.bak-YYYYMMDD-HHMMSS mlruns.db

Verify the restored file:

sqlite3 mlruns.db "PRAGMA integrity_check;"   # expect: ok
sqlite3 mlruns.db ".tables"                    # expect MLflow schema

Re-run any pipeline runs whose results were lost between the backup timestamp and the corruption event. Use the run YAMLs and metadata.yaml artefacts on disk under runs/ to identify what was in flight at the time (these are written by prepare_hindcast_mlflow in lib/tracking/decorators.py regardless of MLflow state).

Branch C — DB missing¶

Symptom: mlruns.db is gone or zero bytes; MLflow raises OperationalError: unable to open database file or logs no such table: experiments.

[PLACEHOLDER: confirm whether MLflow auto-creates a fresh SQLite store on the next mlflow.set_tracking_uri + run-start when the file is absent. If it does, the operational choice is to accept loss of run history for that experiment and start fresh. If it does not, the only option is to copy the nearest backup as in Branch B step 4.]

Until the placeholder is resolved, default to Branch B step 4: restore from the latest mlruns.db.bak-* rather than relying on auto-create. Document the gap in run history in the run-log channel [PLACEHOLDER: slack_channel].

4. Verification¶

Run all three checks; all must pass before declaring the system recovered.

Schema is intact:
```
sqlite3 /data/processing/github/treefera-market-insights/mlruns.db \
  ".tables"
```
Expect MLflow's standard tables — at minimum experiments, runs, metrics, params, tags, latest_metrics, and the alembic_version row.
Smoke-test a real run end-to-end against a small commodity config. From the repo root:
```
export INPUT_DATA_DIR=/data/processing/github/treefera-market-insights
uv run python -m market_insights_models.src.commodity_hindcast.cli \
  run hindcast --config market_insights_models/src/commodity_hindcast/configs/corn_usa.yaml
```
The run must complete without raising OperationalError and must produce a new row in runs (check via sqlite3 mlruns.db "SELECT COUNT(*) FROM runs;" before and after).
UI accessible:
```
mlflow ui --backend-store-uri sqlite:///mlruns.db --host 127.0.0.1 --port 5000
```
Browse to http://127.0.0.1:5000, confirm the experiment list renders, and open one run.

5. Prevention¶

Never run two pipelines for the same commodity in parallel. This is the single documented MLflow failure mode for the project — see MEMORY.md "Known Issues". Serialise same-commodity invocations.
All five commodity YAMLs ship mlflow_tracking_uri: "sqlite:///mlruns.db" (configs/corn_usa.yaml:9, configs/cotton_usa.yaml:10, configs/soybeans_usa.yaml:10, configs/soybeans_bra.yaml:20, configs/wheat_usa.yaml:11). They resolve to the same file because tracking_uri_anchored() anchors the relative URI under data_root (lib/tracking/decorators.py:43-78). Cross-commodity parallelism is therefore also writing into one SQLite file — treat it as risky and serialise where practical.
If a future migration to a real tracking server is approved, swap the YAML field to an HTTP(S) URI; tracking_uri_anchored passes those through unchanged (lib/tracking/decorators.py:59-60). That removes the SQLite single-writer bottleneck entirely.
Keep INPUT_DATA_DIR set to the repo root for every shell that invokes the CLI; otherwise the DB ends up scattered per-cwd (see the warning at lib/tracking/decorators.py:65-75).

6. Backup policy¶

[PLACEHOLDER: document the backup cadence (suggested: hourly during business hours, daily otherwise), retention window (suggested: 30 days of timestamped copies plus monthly archives), the cron job or systemd timer that produces the backups, and the on-call owner who is responsible for verifying it.]

The naming convention to preserve is the one already in use in the repository: mlruns.db.bak-YYYYMMDD-HHMMSS, UTC. The reference example present in the worktree is mlruns.db.bak-20260506-091937. Any automation that produces backups must:

Copy with cp -a (preserve timestamps) or use sqlite3 .backup for an online-safe copy (https://www.sqlite.org/cli.html#special_commands_to_sqlite3_dot_commands_):
```
sqlite3 mlruns.db ".backup 'mlruns.db.bak-$(date -u +%Y%m%d-%H%M%S)'"
```
.backup is the supported online-backup command and is safe to run while writers are connected; a plain cp of a file with active writers can capture a torn page.
Write the backup next to mlruns.db at the repo root so this runbook finds it via ls mlruns.db.bak-*.
Log the backup timestamp to [PLACEHOLDER: slack_channel] so on-call knows the freshness window during an incident.