After a node’s collection job was killed by a CancellationException thrown from dispatch, the node became permanently un-observable: a subsequent observe() call for the same node id silently skipped setup, no collection job was started, and the UI received no further state updates for that node.
The finally block inside DefaultNodeObserver’s inner collection coroutine only logged on exit; it never removed the dead job from the jobs map:
1
2
3
4
5
6
7
8
jobs[node.value.id] = observerScope.launch {
try {
node.collect(collector)
} finally {
logger.w("exited node observing job ${node.details()}")
// jobs[id] NOT removed here — stale entry persists
}
}
When dispatch threw a CancellationException, the inner job was cancelled and exited via finally, but jobs[id] still held the dead Job reference. The idempotency check at the top of observe() —
1
if (!jobs.containsKey(node.value.id)) { … }
— found the stale entry and returned without creating a new collection job. All subsequent observations for that node id were silently dropped.
val before the inner launch (cleaner closure).finally block, remove the stale entry under withContext(NonCancellable) so the mutex can be acquired even while the job is in a cancelling state:1
2
3
4
5
6
7
8
9
10
11
val id = node.value.id
jobs[id] = observerScope.launch {
try {
node.collect(collector)
} finally {
logger.w("exited node observing job ${node.details()}")
withContext(NonCancellable) {
mutex.withLock { jobs.remove(id) }
}
}
}
Added regression test re-observe after CancellationException restarts dispatch in DefaultNodeObserverDispatchTest.
finally. If it only removes on the explicit “cancel this specific job” path, the stale-entry case is always lurking.finally in a coroutine must use withContext(NonCancellable) for any suspending cleanup (mutex acquire, channel send, etc.) — without it, the cleanup suspension throws CancellationException and the cleanup is skipped silently.finally block runs in a cancelled coroutine without NonCancellable, it appears to succeed from the caller’s perspective (no exception), but any suspending operations inside are skipped. This is the same shape as the earlier close() async-teardown race (krill#465); the pattern recurs whenever teardown suspends without opting into NonCancellable.