Post

Krill Peer Mesh Network Architecture

Deep dive into Krill's peer-to-peer mesh networking including beacon discovery, cluster PIN trust, SSE real-time updates, and server settings bootstrap

Krill Peer Mesh Network Architecture

Krill Peer Mesh Network Architecture

Overview

This document provides a comprehensive analysis of how Krill’s peer-to-peer mesh networking operates, including beacon discovery with rolling TOTP cluster tokens, PIN-based cluster trust, SSE (Server-Sent Events) real-time update lifecycle, and the /trust endpoint for certificate download.

Core Components

Key Classes and Their Responsibilities

ClassLocationPurpose
BeaconSupervisorsharedManages beacon listening and broadcasting lifecycle
BeaconProcessorsharedProcesses incoming beacons, validates cluster token, detects new/reconnected peers
BeaconSendersharedSends beacon signals with rate limiting; includes rolling TOTP cluster token
PeerSessionManagersharedTracks known peers by installId/sessionId for deduplication
ServerHandshakeProcesssharedHandles trust establishment (cert download, health check) and node synchronization
CertificateCachesharedCaches validated connections to avoid redundant cert downloads
SSEBosssharedManages SSE connections to servers for real-time updates
ServerNodeManagersharedServer-side node management with nodeUpdates SharedFlow for SSE broadcasting
ServerServerProcessorsharedProcesses Server node state changes, triggers connection flow

NodeWire - Beacon Data Structure

1
2
3
4
5
6
7
8
data class NodeWire(
    val timestamp: Long,      // When beacon was sent
    val installId: String,    // Stable peer identity (UUID)
    val host: String,         // Hostname or IP
    val port: Int,            // Server port (0 for apps)
    val sessionId: String,    // Current session ID (changes on restart)
    val clusterToken: String  // Rolling TOTP token derived from cluster PIN
)

Key Identity Rules:

  • installId is the primary peer identity (stable across restarts, IP changes)
  • sessionId changes on each restart (used to detect peer restarts)
  • clusterToken is a time-based rolling token derived from the cluster PIN; used to validate cluster membership before any trust is established
  • host:port is used for network connectivity but NOT for identity keying

Beacon Discovery Flow

Beacon Classification

  • Server beacons: port > 0 (servers listen on a specific port)
  • App beacons: port == 0 (apps don’t run a server)

Startup Beacon Flow

sequenceDiagram
    participant App as App/Server
    participant BS as BeaconSupervisor
    participant MC as Multicast
    participant BP as BeaconProcessor
    participant PSM as PeerSessionManager
    participant SHP as ServerHandshakeProcess

    App->>BS: startBeaconProcess()
    BS->>MC: receiveBeacons(callback)
    BS->>BS: sentStartupBeacon = true
    BS->>MC: sendBeacon(wire + clusterToken)
    Note over MC: Beacon broadcast on startup

    MC-->>BS: Incoming beacon received
    BS->>BS: handleIncomeWire(wire)

    alt wire.port > 0 (Server beacon)
        BS->>BP: processWire(wire)
        BP->>BP: validateClusterToken(wire.clusterToken)?

        alt Invalid cluster token
            Note over BP: Skip - not a cluster member
        else Valid cluster token
            BP->>PSM: isKnownSession(wire)?

            alt Duplicate beacon (same session)
                PSM-->>BP: true
                Note over BP: Skip - already known
            else Known host, new session (restart)
                PSM-->>BP: false (but hasKnownHost = true)
                BP->>PSM: add(wire)
                BP->>SHP: trustServer(wire)
            else New host
                PSM-->>BP: false
                BP->>PSM: add(wire)
                BP->>SHP: trustServer(wire)
            end
        end
    else wire.port == 0 (App beacon)
        alt SystemInfo.isServer()
            BS->>MC: sendBeacon(ourWire)
            Note over BS: Server responds to app beacon
        end
    end

Connection Pipeline Flow

Connection Result States

1
2
3
4
5
6
enum class ConnectionResult {
    SUCCESS,            // Connected and synced
    CERTIFICATE_ERROR,  // SSL/TLS issue - need cert download
    NETWORK_ERROR,      // Network unreachable
    AUTH_ERROR          // Bearer token rejected (PIN mismatch)
}

Beacon Validated to SSE Connection

sequenceDiagram
    participant BP as BeaconProcessor
    participant SHP as ServerHandshakeProcess
    participant CC as CertificateCache
    participant HTTP as HTTP Client
    participant SSE as SSEBoss

    BP->>SHP: trustServer(wire)
    Note over BP,SHP: Beacon cluster token already validated

    Note over SHP: Mutex lock with installId-based job key
    SHP->>SHP: Check existing job for installId

    alt Job already running
        Note over SHP: Skip - idempotent
    else New job
        SHP->>CC: hasValidConnection(installId, host, port, sessionId)?

        alt No cached connection
            SHP->>HTTP: GET /trust (download certificate)
            HTTP-->>SHP: Self-signed certificate
            SHP->>SHP: Store certificate in trust store

            SHP->>HTTP: GET /health (Bearer: PIN-derived token)

            alt Success (200 OK)
                SHP->>HTTP: GET /nodes (Bearer: PIN-derived token)
                HTTP-->>SHP: Node data
                SHP->>SHP: Sync all nodes
                SHP->>CC: markValid(installId, ...)
                SHP->>SSE: connect(peer)
                Note over SSE: SSE stream with Bearer: PIN-derived token
                SHP->>SHP: return SUCCESS
            else SSL Error
                SHP->>CC: invalidate(installId)
                SHP->>SHP: Re-download cert, retry
            else Unauthorized (401)
                SHP->>SHP: return AUTH_ERROR
                Note over SHP: PIN mismatch - not same cluster
            end
        else Cached connection valid
            SHP->>SSE: connect(peer)
            SHP->>SHP: return SUCCESS
        end
    end

/trust Endpoint - Certificate Download

The /trust endpoint serves the server’s self-signed TLS certificate. Certificate download happens only after the beacon’s rolling TOTP cluster token has been validated, ensuring that only cluster members download each other’s certificates.

/trust Flow

sequenceDiagram
    participant ServerA as Server A
    participant ServerB as Server B

    Note over ServerA: Received beacon from B<br/>with valid cluster token
    ServerA->>ServerB: GET /trust
    ServerB-->>ServerA: Self-signed certificate (krill.crt)
    ServerA->>ServerA: Store cert in trust store
    ServerA->>ServerB: GET /health (Bearer: PIN-derived token)
    ServerB-->>ServerA: 200 OK
    ServerA->>ServerB: SSE /sse (Bearer: PIN-derived token)
    Note over ServerA,ServerB: Real-time sync established

SSE Real-Time Updates

Architecture Overview

Real-time node updates flow from server to clients using Server-Sent Events (SSE). This replaces the previous WebSocket-based approach with a simpler, more reliable unidirectional stream.

sequenceDiagram
    participant Client as App Client
    participant SSEBoss as SSEBoss
    participant Server as Ktor Server
    participant NM as ServerNodeManager
    participant Actor as Actor Channel

    Client->>SSEBoss: connect(serverNode)
    SSEBoss->>Server: GET /sse (HTTPS, Bearer: PIN-derived token)
    Server-->>SSEBoss: Initial server state

    Note over NM,Actor: Node state change occurs
    NM->>Actor: update(node)
    Actor->>Actor: updateInternal(node)
    Actor->>NM: emit to _nodeUpdates SharedFlow

    NM-->>Server: nodeUpdates.collect { node }
    Server-->>SSEBoss: SSE event: node JSON
    SSEBoss->>Client: nodeManager.update(node)

SSE Connection Flow

stateDiagram-v2
    [*] --> Disconnected

    Disconnected --> Connecting: SSEBoss.connect(node)
    Connecting --> Connected: SSE stream established
    Connecting --> Disconnected: Connection error

    Connected --> Receiving: Receive node update
    Receiving --> Connected: Update local NodeManager

    Connected --> Disconnected: Connection closed
    Connected --> Disconnected: Error

    Disconnected --> [*]: Server removed

    note right of Disconnected: Reconnection attempted on next beacon
    note right of Connected: Real-time updates flow via SSE

Key Components

Server-Side (Routes.kt):

1
2
3
4
5
6
7
8
9
10
11
sse("/sse", serialize = { _, node ->
    fastJson.encodeToString<Node>(node as Node)
}) {
    // Send current server state on connect
    send(nodeManager.readNodeState(installId()).value)

    // Collect from the nodeUpdates SharedFlow for real-time updates
    nodeManager.nodeUpdates.collect { node ->
        send(node)
    }
}

Client-Side (SSEBoss.kt):

1
2
3
4
5
6
7
client.sse(urlString = sseUrl.toString()) {
    incoming.collect { event ->
        deserialize<Node>(event.data)?.let { node ->
            nodeManager.update(node)
        }
    }
}

NodeManager SharedFlow Integration

The ServerNodeManager uses the actor pattern for thread-safe updates and emits to a SharedFlow for SSE broadcasting:

1
2
3
4
5
6
7
8
9
10
11
// In ServerNodeManager.updateInternal()
private fun updateInternal(node: Node) {
    // Update the StateFlow (triggers NodeObserver → Processor)
    val f = nodes.getOrPut(node.id) { MutableStateFlow(node).also { observe(it) } }
    f.value = node

    // Broadcast to SSE clients (only for owned nodes)
    if (node.isMine() && node.state != NodeState.DELETING) {
        scope.launch { _nodeUpdates.emit(node) }
    }
}

Disconnect Handling

When an SSE connection is lost:

  1. SSEBoss catches the exception and logs it
  2. The job is removed from the active connections map
  3. On next beacon from peer, a new SSE connection is attempted

Peer Server Node State Transitions

stateDiagram-v2
    [*] --> NONE: Beacon validated + connection success

    NONE --> USER_EDIT: User updates settings
    USER_EDIT --> NONE: Connection success
    USER_EDIT --> ERROR: Connection failure

    NONE --> ERROR: SSE disconnect
    ERROR --> NONE: Beacon + successful resync

    [*] --> ERROR: Beacon validated + connection failure (network/cert error)

    note right of ERROR: ERROR state does NOT trigger network work
    note right of USER_EDIT: Triggers connection pipeline

Invariants and Guarantees

Startup Invariants

  1. Client Node EXECUTE once: On app start, client node is created/loaded and executed exactly once
  2. Server Node EXECUTE once: On server start, server node is created/loaded and executed exactly once
  3. Beacon listening starts: Both apps and servers start beacon listening on startup

Beacon Handling Invariants

  1. Server beacons identified by port > 0
  2. Cluster token validated first: Beacons with invalid TOTP cluster tokens are discarded before any further processing
  3. Duplicate beacons ignored: Same installId + sessionId = already known
  4. Peer restarts detected: Same installId + different sessionId = reconnect flow
  5. Server responds to app beacons: Enables discovery

Connection Pipeline Invariants

  1. Idempotent operations: Job keys use installId-sessionId to prevent duplicates
  2. Beacon validates cluster membership: Invalid cluster token = discard beacon, no connection attempted
  3. ERROR state = STOP: ERROR nodes don’t trigger new network work
  4. Servers don’t broadcast deleting nodes: DELETING states are not sent over SSE

Identity Invariants

  1. installId is primary identity: Used for all peer tracking
  2. sessionId detects restarts: New session = peer restarted
  3. host:port for connectivity only: NOT used for identity keying

Troubleshooting Guide

Symptom: Server appears in ERROR state

Likely Causes:

  1. PIN mismatch (server not in same cluster)
  2. Network unreachable
  3. Certificate trust issue

Log Lines to Look For:

1
2
3
4
Received beacon from ${wire.host()}:${wire.port}
Invalid cluster token from beacon - ignoring peer
Bearer token rejected (401) - cluster PIN may not match
SSL/Certificate error for peer ${installId}

Symptom: Duplicate connections

Likely Causes:

  1. Using host:port as key instead of installId (fixed in this update)
  2. Beacon storm without proper deduplication

Log Lines to Look For:

1
2
Connection job already in progress for ${jobKey}
Connection already active for peer ${installId}

Symptom: Node updates not received

Likely Causes:

  1. SSE connection disconnected
  2. ERROR state on peer node
  3. Server not emitting to nodeUpdates SharedFlow

Log Lines to Look For:

1
2
3
SSEBoss connecting failed
SSE sending update: ${type} ${state}
node updated ${details}

Diagram Source Map

These diagrams were derived from the actual code in the repository:

DiagramSource Classes/Functions
Beacon Discovery FlowBeaconSupervisor.startBeaconListener(), BeaconSupervisor.handleIncomeWire(), BeaconProcessor.processWire(), PeerSessionManager.isKnownSession()
Connection Pipeline FlowServerHandshakeProcess.trustServer(), ServerHandshakeProcess.attemptConnection(), ServerHandshakeProcess.downloadAndSyncServerData(), CertificateCache.hasValidConnection()
/trust FlowRoutes.kt GET /trust, ServerHandshakeProcess.trustServer()
SSE Connection FlowSSEBoss.connect(), Routes.kt /sse endpoint, ServerNodeManager.updateInternal(), NodeManager.nodeUpdates
Peer Node State TransitionsServerHandshakeProcess connection results, NodeManager.setErrorState(), NodeManager.complete()


Last verified: 2026-04-03

This post is licensed under CC BY 4.0 by Sautner Studio, LLC.