2026-06-19-deleted-server-resurrected-by-reconnect

Symptom

A server (ghost.local) went offline; the desktop client correctly showed it as disconnected. When the user opened the server’s settings and deleted the node, it immediately re-appeared and kept re-appearing. Logs showed an endless cycle: begin connection attempt → server unreachable, skipping connection → ClientNodeManager: Observing node ... Server ghost WARN → begin connection attempt.

Root cause

Two reconnect engines resurrect a deleted server:

ClientServerConnector calls nodeManager.alarm(node) on every failed connection attempt (ClientServerConnector.kt, the unreachable branch and the coroutine exception handler).
EventClient’s while (isActive) SSE retry loop calls nodeManager.alarm(node) on every failure and reconnects with backoff forever.

alarm() calls update(), whose structural-insert path used getOrPut(node.id) { MutableStateFlow(node).also { observeNode(it) } }. For an id that was just deleted (and therefore absent from the map) this re-created the node and re-armed its observer. The observer immediately re-dispatched into the connector, restarting the connection attempt — an infinite loop. delete() removed the node from the map but had no way to say “and keep it gone,” so any in-flight reconnect won the race.

Fix

Tombstone deliberately-deleted servers. ClientNodeManager now holds a deletedServers set: delete() adds the Server’s install id (synchronously); update()’s structural-insert path drops any update whose id is tombstoned (so alarm/late SSE can’t re-create it); allowServer(id) clears the tombstone and isServerDeleted(id) exposes it. The tombstone is cleared only on clear “server is back” intent: ClientBeaconWireHandler calls allowServer(wire.installId) when a live peer beacon arrives, and ClientServerConnector clears it on an explicit manual (form) re-add. EventClient breaks its retry loop when the server is tombstoned, and ClientServerConnector.performConnectionSanityCheck refuses tombstoned servers — so the client actually stops trying to reconnect. The installId() seam in ClientNodeManager was parameterised (constructor default installIdProvider = installId) so delete() is unit-testable without reading the real ~/.krill/install_id.

Prevention

A shared update()/upsert that uses getOrPut is an implicit “create if absent” — any caller that fires after a delete (failed reconnect, late event) silently resurrects the node. When deletion must be durable against concurrent writers, back it with an explicit tombstone consulted at the single insert chokepoint, not just a map removal.
Regression test ClientNodeManagerDeletedServerTest asserts alarm() cannot re-create a deleted server and that allowServer() re-permits it. It injects an install-id provider and fake FileOperations/NodeObserver, so it never touches a production path — the canonical “parameterise the platform seam” shape.