2026-06-18-stale-httpclient-on-first-cert-fetch

Symptom

On the very first app launch (FTUE: accept TOS, enter PIN), a discovered server appears on ClientScreen but with no child nodes. Restarting the app downloads the nodes correctly. Desktop logs showed, immediately after the first peer connection:

[NodeHttp] 0 Server kraken PAIRING: read node
[NodeHttp] Error getting node 0 Server kraken PAIRING
kotlinx.coroutines.JobCancellationException: Parent job is Completed; job=SupervisorJobImpl{Completed}

Root cause

Not an app-scope race — the completed SupervisorJob belonged to the Ktor HttpClient, not to IO_SCOPE. NodeHttp is a Koin single constructed with the resolved client instance (SharedModule.kt: NodeHttp(httpClient, ...)), so it captures whatever client exists at DI time. On first launch the trust store is empty, so that captured client (client A) trusts no peers.

The first time we contact a peer, DefaultTrustHttpClient.fetchPeerCert downloads and writes the peer cert, then — because the cert is new — calls rebuildHttpClient(). The old HttpClientProvider.rebuild() built a new client (B) and closed the old one (A). Closing a Ktor client completes its internal SupervisorJob. The in-flight connection coroutine then called nodeHttp.readNode()/readNodes() on the still-captured client A, which fails instantly with Parent job is Completed. The server node was already registered, but its children never loaded.

Second launch works because the cert is already on disk (certChanged == false), so no rebuild, no close — client A stays alive. It reproduces exactly once per peer (the first connection ever). NodeHttp lives in krill-sdk (krill-oss) and takes a final HttpClient, so it cannot be made to re-resolve the live client — the fix has to keep the same instance alive.

Fix

HttpClientProvider (JVM + Android HttpClientContainer.*.kt) now builds one client whose trust material is reloaded in place via a new ReloadableX509TrustManager (a delegating X509TrustManager with a @Volatile inner delegate rebuilt from the trust dir on reload()). rebuild() now calls trustManager.reload() instead of swap-and-close, so the client instance — and every reference captured by long-lived holders (NodeHttp, SSE, camera) — is never closed. CIO consults the trust manager on each new handshake, so a connection opened after the reload trusts the new cert. On iOS the Darwin client accepts server trust directly in handleChallenge (TOFU) and never reads the cert, so rebuild() is a deliberate no-op (it never closes the client). The trust dir is injectable so tests use a temp dir.

Prevention

Regression test HttpClientProviderConcurrencyTest.rebuildPreservesClientInstanceAndKeepsItActive asserts rebuild() preserves the same HttpClient instance and leaves its job active. The pre-existing test asserted the opposite (assertNotSame — “rebuild must produce a distinct client”); that encoded invariant was the bug. Lesson: a test that pins “this factory returns a fresh object” is suspect when long-lived consumers capture the object.
General rule: never close() a shared resource that DI consumers captured by value. If trust material/config must change at runtime, mutate it behind a stable handle (reloadable delegate), don’t swap the handle.
Flag for follow-up at krill-oss: NodeHttp taking a concrete HttpClient rather than a () -> HttpClient provider is the underlying fragility; the krill-side reload fully fixes the bug but the SDK seam invites the same class of mistake.