The app that demos itself — and files its own bug reports

How a headless Compose Multiplatform desktop app, driven entirely through MCP, records its own narrated product-demo videos in my cloned voice, publishes them to YouTube — and flushes out real, high-value defects every time it runs.

Posted Jun 26, 2026

By ben@krill.zone

10 min read

A demo video is just an integration test you’re allowed to watch

I have written before about the staff of agents that runs CI/CD for Krill → — Ghost tests, Blue fixes — and about Kraken, the third agent and the machine that remembers →. This post is about a fourth thing the same machinery now does, one I didn’t see coming when I started: Krill records its own product-demo videos, narrates them in my voice, ships them to YouTube — and along the way it finds bugs that nothing else was finding.

The pitch sounds like marketing automation. It isn’t, or it isn’t only that. The interesting part is the second half of the title. When you drive the real application through its real control surface and then watch the recording, you are running the most honest integration test there is: a human-shaped pass over the whole stack, except the human is a script and the eyeballs are a frontier model reading frames. If a feature doesn’t work, it doesn’t work on camera. You can’t hand-wave a video.

The architecture is the prompt. It turns out the architecture is also the test harness.

What the pipeline actually does

The whole thing lives in scripts/demo/ on the kraken box — a small uv project, one stage per file:

Read a demo script. Each demo is a single declarative YAML — an intro card, a list of scenes, a narration line per scene, an outro. The first one is the 5-second counter loop: a Cron timer, a counter DataPoint, and a Calculation of counter + 1, wired so the counter feeds its own calculation and ticks up every five seconds. The smallest possible swarm that shows Krill’s observer pattern. Every later demo is the same pipeline with a different YAML.
Narrate it. Each scene’s line goes to ElevenLabs in my cloned voice and gets cached. (Same voice the Kraken post talks about — except here it’s the real ElevenLabs clone reading copy, not the local QLoRA adapter writing it.)
Build the swarm, live, through MCP. The orchestrator starts a headless display — Xvfb :99 — launches the actual Compose Multiplatform desktop binary (krill-desktop.jar, the same fat jar a human downloads), and then drives a real Krill server by speaking to the Krill MCP over JSON-RPC. It creates the nodes, wires them, kicks them, and the app on the virtual screen renders every step as it happens. ffmpeg captures the window with x11grab.
Frame it like a product, not a screenshot. Krill is a portrait, mobile-first app; stretched across a 1080p frame the canvas sprawls and the nodes read tiny. So the recorder sizes the window to a 10″ landscape-tablet panel, centers it, captures only that region, and the muxer pads it onto a Krill-dark card background with a simulated tablet device frame — “a device on a backdrop.” It looks like Krill running on a tablet, with legible nodes and a generous dark margin.
Mux the finished film. ffmpeg assembles [intro card] + [screen recording] + [outro card], lays the narration over the recording, and normalizes the loudness to YouTube’s ~-14 LUFS reference so it doesn’t play back quiet. Intro and outro are rendered locally with drawtext — zero cloud dependencies in the assembly step. Out comes an H.264/AAC 1080p MP4.
Publish it. A GitHub Actions workflow (Demo Render) runs the whole thing on the kraken self-hosted runner — the box that already has the GPUs, the Krill server, and the MCP. The exact same code path runs on my laptop and in CI; the only difference is where the ElevenLabs key comes from. The MP4 plus a YouTube metadata sidecar (SEO description, chapter markers built from the real per-scene timings) lands on the cms.krill.systems CDN, and a second workflow syncs it to the YouTube channel. From there it embeds back here on the blog.

The counter ticks 0, 1, 2, 3 on its own, narrated, because the swarm wires its own feedback loop — invoked, not executed. No human touched a keyboard while that recorded.

The same MCP the apps use — that’s the whole trick

Here’s the architectural fact that makes this cheap instead of impossible: the demo pipeline is just another MCP client. It doesn’t poke a private test API or a mock. It talks to the same krill-mcp server → that Claude Desktop uses to drive a swarm, that the apps themselves are built around. There is no shadow data model and no second code path to keep in sync. create_node, wire it, record_snapshot, read_series — the demo speaks the product’s own language.

That’s not an accident of this project; it’s a dividend of a year of foundation work. In Krill, everything is a typed Node — DataPoints, Triggers, Filters, Executors, Pins — with one consistent shape, one processor pattern, one MCP surface over the top. Because the control surface was designed as a first-class interface and not bolted on, a brand-new consumer — a video recorder, of all things — could drive the entire system without a single new seam. You cannot do this on a codebase where “how you make the app do a thing” is a different mechanism on every screen.

The Raspberry Pi GPIO demo makes the point hardest: that one builds two switches and an AND gate that drives BCM pin 17 high and low on a real Pi 5 — not the sandbox, an actual board on the LAN — through the same MCP, with the same desktop binary watching it happen. KMP all the way down: the shared Kotlin that renders on the tablet panel is the shared Kotlin that runs the server on the Pi.

Why the recordings find bugs nothing else does

Now the part the title promised. Every demo is an integration test, and the friction rule is absolute: if the pipeline needs the app to do something and it can’t, the orchestrator files a GitHub issue assigned to Blue and aborts — no silent workarounds. No quietly reaching past the MCP, no faking the node, no “good enough for the video.” If the product can’t do it cleanly, that’s a defect, and the demo’s job is to surface it, not paper over it.

Two kinds of failure fall out of this, and both are valuable:

Missing capability → a friction ticket for Blue. The first time a demo needed to reconfigure a node after creating it, there was no MCP tool for it — you could set metadata at create_node time but not after. That’s a real gap a real user would hit. So the pipeline filed it, assigned to krill-blue-bot, and the usual loop took it from there — Blue reproduces, writes a test, fixes, records a lesson, opens a PR; Ghost verifies on real hardware; I make the call. The demo didn’t route around the missing feature. It demanded the feature exist.

Visible wrongness → a real correctness bug, caught on camera. This is the one I didn’t expect. I recorded an early counter-loop and the swarm built correctly, the counter ticked correctly — and there were no connection arcs between the nodes. The video just looked dead. The feature wasn’t subtly off; it was visibly off, and you only see “visibly” when something is actually rendering frames.

It turned out to be a genuine bug in the client, not the recorder. When a second client (my MCP build) edits a node’s wiring, the server fans that out over SSE as a state change carrying only state, no metadata — and the app’s handler for that case just logged it. So the app’s stored node kept empty sources, and both arc systems stayed dark: the persistent source arcs and the cyan interaction-flash. A human clicking around in a single app never triggers it, because they’re the one client making the edit. It takes a second observer to expose it — which is exactly what a recorded demo is. I root-caused it, filed it, and the fix shipped through the same agent pipeline.

Same story with same-host server discovery (the app filtered out its own server’s beacon by hostname instead of install ID, so a local server never appeared on the canvas), and with a cluster of UX gaps — Calculation and CronTimer nodes rendering no name label, no automation hook to focus a node’s editor for a second client. Every one of them is a real issue, in the tracker, fixed by Blue and verified by Ghost. Building the demo machine became one of the better QA passes Krill has ever had, because it exercises the real app, end to end, and then forces a frontier model and me to look at the result.

A unit test tells you a function returns the right value. A recorded demo tells you the product looks like it works. Those are very different guarantees, and the second one is the one customers actually experience.

End-to-end agentic development for Kotlin Multiplatform

Step back and look at the loop in full:

A demo YAML          →  a declarative product story
The demo pipeline    →  drives the real KMP app through real MCP, records it
A frontier model     →  watches the frames, narrates, judges what's wrong
Friction / defects   →  filed as issues, assigned to Blue
Blue                 →  reproduces, fixes, writes a lesson, opens a PR
Ghost                →  verifies the fix on real hardware
Ben                  →  makes the call, merges
The video            →  ships to YouTube and back onto this page

Authoring, recording, narrating, judging, ticketing, fixing, verifying, publishing — and the only human decision in the chain is the one I keep on purpose: the merge. That’s end-to-end agentic development for a Kotlin Multiplatform project, and it closes a loop the earlier posts left open. Ghost and Blue handled fixes. Kraken handled oversight. The demo pipeline handles the thing a user sees first — and feeds whatever’s broken about it straight back into the fix loop.

None of this would work without the foundation, and I’ll keep saying it because it’s the only part that matters: you cannot agent your way out of bad architecture. The reason a video recorder could drive the whole app is that there was one clean way to drive the app. The reason the arc bug was findable is that the wiring is a typed, observable contract and not a pile of ad-hoc callbacks. The reason every demo is the same pipeline with a different YAML is that every node is the same shape with a different type. Agents — and recorders, and judges — are amplifiers. Point them at a clean signal and they make it cleaner. Point them at a swamp and they help you sink on film.

Build the foundation. Then let the app demo itself. Then go to the garden.

— Ben

Last verified: 2026-06-26

Guide

mcp claude ai llm automation kotlin compose-multiplatform elevenlabs ci-cd OSS

This post is licensed under CC BY 4.0 by Sautner Studio, LLC.