A server (ghost.local) went offline; the desktop client correctly showed it as
disconnected. When the user opened the server’s settings and deleted the node, it
immediately re-appeared and kept re-appearing. Logs showed an endless cycle:
begin connection attempt → server unreachable, skipping connection →
ClientNodeManager: Observing node ... Server ghost WARN → begin connection attempt.
Two reconnect engines resurrect a deleted server:
ClientServerConnector calls nodeManager.alarm(node) on every failed
connection attempt (ClientServerConnector.kt, the unreachable branch and the
coroutine exception handler).EventClient’s while (isActive) SSE retry loop calls nodeManager.alarm(node)
on every failure and reconnects with backoff forever.alarm() calls update(), whose structural-insert path used
getOrPut(node.id) { MutableStateFlow(node).also { observeNode(it) } }. For an id
that was just deleted (and therefore absent from the map) this re-created the
node and re-armed its observer. The observer immediately re-dispatched into the
connector, restarting the connection attempt — an infinite loop. delete() removed
the node from the map but had no way to say “and keep it gone,” so any in-flight
reconnect won the race.
Tombstone deliberately-deleted servers. ClientNodeManager now holds a
deletedServers set: delete() adds the Server’s install id (synchronously);
update()’s structural-insert path drops any update whose id is tombstoned (so
alarm/late SSE can’t re-create it); allowServer(id) clears the tombstone and
isServerDeleted(id) exposes it. The tombstone is cleared only on clear
“server is back” intent: ClientBeaconWireHandler calls allowServer(wire.installId)
when a live peer beacon arrives, and ClientServerConnector clears it on an explicit
manual (form) re-add. EventClient breaks its retry loop when the server is
tombstoned, and ClientServerConnector.performConnectionSanityCheck refuses
tombstoned servers — so the client actually stops trying to reconnect. The
installId() seam in ClientNodeManager was parameterised (constructor default
installIdProvider = installId) so delete() is unit-testable without reading the
real ~/.krill/install_id.
update()/upsert that uses getOrPut is an implicit “create if absent”
— any caller that fires after a delete (failed reconnect, late event) silently
resurrects the node. When deletion must be durable against concurrent writers,
back it with an explicit tombstone consulted at the single insert chokepoint,
not just a map removal.ClientNodeManagerDeletedServerTest asserts alarm() cannot
re-create a deleted server and that allowServer() re-permits it. It injects an
install-id provider and fake FileOperations/NodeObserver, so it never touches a
production path — the canonical “parameterise the platform seam” shape.