Two failure modes observed on the Verify Agent Ghost workflow run for
PR #216:
25290400159 (action opened) was created at 20:47:55Z and run
25290400368 (action labeled) at 20:47:56Z — both against
commit b02e73e5. The concurrency group verify-agent-ghost
serialised them, but both still consumed runner time.Install built packages step exited with sudo-rs: timed out
exactly 300s after the step started. No useful logs, just a hard
stop. Because this step runs after ~30 min of WASM / ProGuard /
shadowJar builds, the wasted-cycles cost per failed run is large.gh pr create --label
needs-qa-verify), GitHub fires both opened and labeled webhook
events. The previous trigger config (types: [opened, synchronize,
reopened, labeled]) admitted both, and the if condition was true
in both cases (label present AND either action != labeled OR the
labelled label is needs-qa-verify). Two distinct workflow runs
fired against the identical PR head SHA.ben user does not have a
NOPASSWD entry covering apt-get, systemctl, rm /srv/krill/...,
or journalctl -u krill. sudo apt-get install -y … therefore
prompted for a tty password, got none from the GitHub Actions
non-interactive shell, and waited until sudo-rs’s default
passwd_timeout of 300s elapsed before exiting..github/workflows/Verify Agent Ghost.yml:
opened from pull_request.types. labeled always fires
for each label present at PR creation, so the open-with-label case
is still covered without firing twice.sudo -n true and exits with
a ::error:: annotation pointing at this lessons doc if NOPASSWD
is missing. Fails in seconds instead of after 30 min of JAR builds.Install built packages, Wipe local krill DB and
restart krill.service, and Smoke-test installed services steps
to sudo -n … so any future regression in the runner’s sudoers
file fails immediately rather than hitting the 5-min passwd
timeout.The host-side fix (the actual NOPASSWD entry on the ghost runner) lives
outside this repo. The expected /etc/sudoers.d/krill-ghost-runner
contents:
1
ben ALL=(ALL) NOPASSWD: /usr/bin/apt-get, /usr/bin/systemctl, /usr/bin/journalctl, /usr/bin/rm /srv/krill/data/*.db, /usr/bin/dpkg
[synchronize, reopened, labeled] over the four-event default.
opened + labeled both firing on PR creation is well-known but
easy to miss in review. The preserved comment block at the top of
the workflow file explains why opened is intentionally absent so a
future “let’s add opened for completeness” PR doesn’t silently
reintroduce the duplication.sudo for installation or systemd ops should preflight with
sudo -n true before the expensive build steps. Five minutes of
wasted CI is forgivable; thirty minutes is not. Pair the preflight
with sudo -n … everywhere downstream so the failure mode is always
“exit immediately” rather than “hang until passwd_timeout.”