Nightly architectural scan flagged ClientNodeManager.nodes as a non-thread-safe
LinkedHashMap accessed from multiple coroutines on Dispatchers.Default. The
existing try/catch around nodes() (returning emptyList() on exception) was
a silent indicator the race had been observed before. collectNodeIdsToRemove
could throw ConcurrentModificationException and leave zombie nodes in the swarm.
nodes: MutableMap<String, MutableStateFlow<Node>> = mutableMapOf() is a plain
LinkedHashMap. The shared CoroutineScope uses Dispatchers.Default, which
schedules coroutines on a thread pool. Concurrent calls to update() (SSE handler)
and remove() (DELETING-state processor) could interleave map structural mutations
with iterations in collectNodeIdsToRemove, nodes(), and children().
Mutex nodesMutex to serialise all structural mutations (getOrPut, remove).@Volatile var nodeValues: List<MutableStateFlow<Node>> refreshed after every
mutation inside the lock; non-suspend readers (nodes(), children()) iterate this
stable snapshot instead of the live map.update() dispatches the map mutation and swarm update together into scope.launch { mutateNodes { ... } } so both are atomic relative to concurrent readers.remove() made suspend (all callers were already in coroutine contexts); it uses
mutateNodes {} for the batch removal.delete() moves its nodes.remove call into an async scope.launch { mutateNodes { ... } }.collectNodeIdsToRemove() reads from nodeValues (stable snapshot) instead of
iterating nodes.values directly.updateHandMadeServerWithRealData() made suspend to allow calling remove().LocalSwarmHost.deleteNode() updated to dispatch remove() via scope.launch {}.mutableMapOf, mutableListOf) must not be shared
across coroutines that run on Dispatchers.Default or Dispatchers.IO without
synchronisation. Treat them the same as var shared state..values, .filter {}) is the most
dangerous pattern — CME fires here, not on single-key .get(). If you see a
try/catch around a map iteration returning a fallback, it’s a concurrency smell.Mutex + @Volatile snapshot. Mutations hold the lock and
refresh the snapshot; reads use the snapshot outside the lock.