2026-06-24-alarm-stale-node-overwrite

Symptom

Kraken’s nightly architectural scan flagged that ClientNodeManager.alarm() used the caller’s captured (potentially stale) Node snapshot for its write, while reset() and startPairing() already correctly read the current stored value first. In production, EventClient and ClientServerConnector both call alarm(node) with a node captured at connection-start time. During exponential backoff (up to 60 s per attempt), SSE events can update the stored node’s metadata; the alarm call would then write back the old metadata, silently reverting wiring changes or snapshot updates that arrived during the retry window.

Root cause

alarm() originally read readNodeStateOrNull(node.id).value?.let { original -> if (original.state == NodeState.DELETING) return } and then wrote update(node.copy(state = NodeState.WARN, ...)) using the passed-in node. Two problems:

The write used the stale caller snapshot instead of the current stored node, clobbering any metadata updates that arrived between capture and alarm.
The DELETING guard was ineffective: readNodeStateOrNull filters out DELETING nodes (via nodeAvailable), so the guard never fired — a DELETING node could be resurrected as WARN.

Both reset() and startPairing() had already been fixed to read the current stored node before writing; alarm() was missed.

Fix

shared/src/commonMain/kotlin/krill/zone/shared/node/manager/ClientNodeManager.kt: rewrite alarm() to read nodes[node.id]?.value ?: node (bypassing nodeAvailable so DELETING nodes are still guarded), guard on current.state == NodeState.DELETING, then write current.copy(state = NodeState.WARN, ...).
shared/src/jvmTest/kotlin/krill/zone/shared/node/manager/ClientNodeManagerAlarmTest.kt: two regression tests — one asserting metadata is preserved when alarm fires with a stale snapshot, one asserting DELETING nodes are not resurrected.

Prevention

Any ClientNodeManager method that makes a lifecycle/state transition (alarm, reset, startPairing, run, editing, edited) should read the current stored node (nodes[id]?.value ?: node) before writing, not use the caller-supplied snapshot. The “read current first” comment on reset() is the canonical pattern — propagate it to every similar method.
readNodeStateOrNull is not suitable for DELETING guards because nodeAvailable filters them out. Use nodes[id]?.value directly when the code must distinguish DELETING from absent.