On the very first app launch (FTUE: accept TOS, enter PIN), a discovered server appears on ClientScreen but with no child nodes. Restarting the app downloads the nodes correctly. Desktop logs showed, immediately after the first peer connection:
1
2
3
[NodeHttp] 0 Server kraken PAIRING: read node
[NodeHttp] Error getting node 0 Server kraken PAIRING
kotlinx.coroutines.JobCancellationException: Parent job is Completed; job=SupervisorJobImpl{Completed}
Not an app-scope race — the completed SupervisorJob belonged to the Ktor HttpClient, not
to IO_SCOPE. NodeHttp is a Koin single constructed with the resolved client instance
(SharedModule.kt: NodeHttp(httpClient, ...)), so it captures whatever client exists at DI time.
On first launch the trust store is empty, so that captured client (client A) trusts no peers.
The first time we contact a peer, DefaultTrustHttpClient.fetchPeerCert downloads and writes the
peer cert, then — because the cert is new — calls rebuildHttpClient(). The old
HttpClientProvider.rebuild() built a new client (B) and closed the old one (A). Closing a
Ktor client completes its internal SupervisorJob. The in-flight connection coroutine then called
nodeHttp.readNode()/readNodes() on the still-captured client A, which fails instantly with
Parent job is Completed. The server node was already registered, but its children never loaded.
Second launch works because the cert is already on disk (certChanged == false), so no rebuild,
no close — client A stays alive. It reproduces exactly once per peer (the first connection ever).
NodeHttp lives in krill-sdk (krill-oss) and takes a final HttpClient, so it cannot be made
to re-resolve the live client — the fix has to keep the same instance alive.
HttpClientProvider (JVM + Android HttpClientContainer.*.kt) now builds one client whose
trust material is reloaded in place via a new ReloadableX509TrustManager (a delegating
X509TrustManager with a @Volatile inner delegate rebuilt from the trust dir on reload()).
rebuild() now calls trustManager.reload() instead of swap-and-close, so the client instance —
and every reference captured by long-lived holders (NodeHttp, SSE, camera) — is never closed.
CIO consults the trust manager on each new handshake, so a connection opened after the reload
trusts the new cert. On iOS the Darwin client accepts server trust directly in handleChallenge
(TOFU) and never reads the cert, so rebuild() is a deliberate no-op (it never closes the client).
The trust dir is injectable so tests use a temp dir.
HttpClientProviderConcurrencyTest.rebuildPreservesClientInstanceAndKeepsItActive
asserts rebuild() preserves the same HttpClient instance and leaves its job active. The
pre-existing test asserted the opposite (assertNotSame — “rebuild must produce a distinct
client”); that encoded invariant was the bug. Lesson: a test that pins
“this factory returns a fresh object” is suspect when long-lived consumers capture the object.close() a shared resource that DI consumers captured by value. If trust
material/config must change at runtime, mutate it behind a stable handle (reloadable delegate),
don’t swap the handle.NodeHttp taking a concrete HttpClient rather than a
() -> HttpClient provider is the underlying fragility; the krill-side reload fully fixes the bug
but the SDK seam invites the same class of mistake.