- Automatic test / admin lock —
lager pythonand the box-mutating admin commands (lager install,lager uninstall,lager update,lager install-wheel) reserve the box for the lifetime of the command. - User lock —
lager boxes lockexplicitly reserves a box until you unlock it.
Automatic test lock
Everylager python <runnable> invocation automatically acquires the box
lock at start and releases it at end. This includes failures, Ctrl+C,
crashes, and signal-killed runs — the lock is released through a finally
block, a signal handler, an atexit net, and (worst case) a server-side
TTL reap.
Which commands auto-lock
| Command | Lock window | Why |
|---|---|---|
lager python | Full test run (acquire → heartbeat → release) | Canonical test runner. |
lager install | The setup_and_deploy_box.sh step (the part that restarts the container) | Container restart mid-test would kill the test outright. |
lager uninstall | Container teardown, image wipe, ~/box and /etc/lager removal | Same — destructive on-box mutation. |
lager update | Container stop → image rebuild → restart → health check | Container restart is the test-clobbering action. Read-only probe / fetch are deliberately outside the lock. |
lager install-wheel | The pip install invocation inside the container | pip install mutates the container’s Python environment; a concurrent test could race on imports. |
lager hello, lager boxes list, lager boxes lock /
unlock itself, status / dry-run paths, etc.) do not acquire the
auto-lock. Note: this is intentionally narrower than the v0.12–0.13.3
behavior, which slapped a --force-command-overridable lock on every
single CLI command. See Backward compatibility below for the
v0.13.4 history.
The lock identity is CI-aware so concurrent test runs in CI mutually
exclude correctly. Holder formats:
| Environment | Holder string |
|---|---|
| Dev (your machine) | OS user (same as lager defaults --user) |
| GitHub Actions | ci:github:<repo>#<run>-<attempt>/<job>@<runner>:<pid> |
| Drone | ci:drone:<repo>#<build>:<pid>@<host> |
| GitLab CI | ci:gitlab:<project>#<pipeline>/<job>:<pid>@<host> |
| Bitbucket Pipelines | ci:bitbucket:<repo>#<build>:<pid>@<host> |
| Jenkins | ci:jenkins:<tag>:<pid>@<host> |
| Generic CI fallback | ci:generic:<host>:<pid> |
:pid (and @runner / @host) suffix guarantees that two parallel
matrix items in the same workflow run get distinct holder strings.
Collision behavior
Whenlager python tries to acquire a lock that another holder owns:
- On dev: prints an error and exits 1 immediately (no waiting).
- In CI: waits up to
LAGER_LOCK_WAITseconds (default1800, i.e. 30 min), polling every 2s, and only fails if the wait elapses. This lets matrix jobs queue against the same self-hosted box.
lager boxes locked the box as yourself before running
lager python, the CLI sees the lock as already-ours and does not
release it on exit — your explicit reservation survives the test.
TTL & heartbeat
Each test lock is written withttl_seconds: 1800 and refreshed every
60 seconds by a background heartbeat thread inside the CLI. The TTL is
not a cap on test runtime — as long as the heartbeat keeps refreshing
last_heartbeat, the lock stays valid indefinitely.
What the TTL actually bounds is the worst-case stale-lock dwell time
after a CLI crash. If your laptop loses network or the CI runner is
hard-killed, the box reaps the lock once last_heartbeat + ttl_seconds
falls in the past, so another caller waits at most one TTL.
--detach keeps the lock
lager python script.py --detach acquires the lock with ttl_seconds: null
(no auto-expiry) because the heartbeat thread dies with the CLI. The
detached script keeps running on the box, but the lock must be released
manually:
Escape hatches
| Env var | Effect |
|---|---|
LAGER_AUTO_LOCK_DISABLE=1 | Skip auto-lock entirely. The command still checks for someone else’s user lock but does not acquire. |
LAGER_LOCK_WAIT=<seconds> | Override collision wait time. 0 = fail-fast (dev default), large value = patient queue (CI default). |
LAGER_LOCK_HOLDER=<string> | Override the holder identity. Useful when you intentionally want two jobs to share a single reservation. |
LAGER_LOCK_TTL=<seconds> | Override the TTL the CLI writes. LAGER_LOCK_TTL=none = eternal (caller must lager boxes unlock). |
LAGER_LOCK_HEARTBEAT=<sec> | Override the heartbeat refresh interval (default 60s). |
User lock
A user lock is an explicit, persistent reservation you place on a box. Unlike the automatic test lock, user locks never expire — you must manually unlock when you’re done. Use cases:- Reserving a box for an extended debugging session.
- Preventing others from using a box during maintenance.
- Claiming a box when you’re not actively running a command.
lager boxes lock
--box(required) — name of the box to lock.--user— username to lock as (useful when running inside Docker where the user would otherwise beroot).
lager boxes unlock
--box(required) — name of the box to unlock.--force— force unlock even if the box was locked by another user (use this to clear a stalelager boxes lockleft by a teammate).
Management operations skip the lock
The following sub-commands oflager python are management operations
on already-running processes and intentionally skip both lock checks and
auto-acquire:
lager python --kill <ID>lager python --kill-alllager python --reattach <ID>lager python --continue <ID>lager python --console <ID>
--kill it without first having to fight an unrelated user lock.
lager boxes shows lock holders
When boxes are locked, lager boxes shows an extra column:
github lager run 9182 job test on runner-3) rather than printed as raw colon-delimited strings.
CI workflow example
The always-on auto-lock + CI auto-wait combination means a CI matrix job needs no special invocation:...GITHUB_JOB=hardware-tests/<runner>:<pid>
differs per item), POST /lock, and whichever loses the race waits up to
30 minutes for the winner to finish before retrying. No lager boxes lock
call needed.
Backward compatibility
lager boxes lockandlager boxes unlockbehave exactly as before. The CLI now sendsholder_type: "user"+ttl_seconds: nullon the wire, but legacy clients (e.g. older CLIs against the new box server) get the same eternal-lock behavior automatically because the server treats a payload with neither field as legacy and applies the same defaults._check_box_lock(the read-only lock check that already gates every command in resolve_and_validate_box) is unchanged.
How this differs from v0.13.0 – v0.13.3 (removed in v0.13.4)
v0.13.0 added an ephemeral “command-in-progress” lock that fired on every CLI command via a shared decorator, gated by a--force-command flag. v0.13.4 removed it because three corner cases
were unfixable in that design:
| v0.13.4 corner case | How this PR avoids it |
|---|---|
| ”Supply commands never released the lock” | The auto-lock is only attached to 5 commands (python, install, uninstall, update, install-wheel), not every CLI surface. Supply commands etc. don’t touch the lock — no decorator-on-everything to leak from. |
| ”Long-running commands blocked all other commands on the same box” | Only those 5 commands check the lock; status / list / read-only paths are unaffected. For genuine concurrent test runs, dev gets fail-fast in <5s and CI gets a queue (default 60s, configurable). That’s the desired policy. |
| ”Detached processes left stale locks” | --detach is opt-in for a long-lived hold (ttl_seconds: null is intentional). Non-detached runs have heartbeat + TTL reap, so an abnormal CLI exit self-recovers in ≤ TTL + grace. |
--force-command is gone. Collision policy is structured (fail-fast
in dev, queue in CI) and the existing lager boxes lock --force is the
escape hatch when you genuinely need to override.
