Shipping a voice agent on Gemini Live, and fixing the barge-in

I’ve been embedding voice agents in a few apps lately, on phone and in the browser.

The phone ones answer calls on a dedicated number. The browser ones live inside the app’s own UI, behind a mic button next to the chat composer.

Different surfaces, same hard problem: the agent has to stop talking the moment the user starts. If it doesn’t, the whole thing feels broken in a way that no amount of model quality or latency tuning can rescue. Getting that right on Gemini Live turned out to be the entire game.

I also tried OpenAI’s Realtime model and was genuinely impressed — the prosody is excellent and barge-in works out of the box. For this client I stayed on Gemini for cost and an existing preference, which meant I had to solve barge-in myself.

UX improves a lot when you can speak to the app. Users told me it makes them reach for it more, and it’s an effective way to consume information that’s normally read.

That’s the why. The rest of this post is the how.

What we built

Voice mode inside a document-style web app. A user opens a document, taps the mic, and starts talking. The agent decides what to do — scroll to a section, answer a factual question from the document, or look something up on the web through existing tools — and speaks the answer back.

The architecture is three actors and two long-lived connections:

Browser ↔ Backend (HTTPS). For minting an ephemeral session token and for proxying to a heavier text agent when needed.
Browser ↔ Gemini Live (WebSocket). For the audio stream and the tool-call protocol.

The mint flow matters. The Google API key never leaves the server. The browser asks the backend for a short-lived token bound to a specific model, voice, modality, and tool set, and connects to Gemini with that. A leaked token is useless outside the document it was minted for.

A few decisions worth flagging because they’re easy to get wrong:

Direct WebSocket, not a backend proxy. Most serverless backends are request/response, not streaming. Proxying would have killed the whole point. Ephemeral tokens make the direct connection safe.
Continuous mic with server VAD, not tap-to-talk. The product is “speak to speak,” which is a conversation, not a walkie-talkie. The browser just streams; Gemini decides where the turn boundaries are.
Client speaks first. Gemini Live never opens a turn on its own — onopen only means the socket is up. We send a one-shot "[greeting]" cue via sendClientContent and let the system prompt turn that into a short spoken intro. As a side effect, this warms the playback pipeline so the first real turn has no cold-start lag.
Two AudioContexts, not one. 16kHz for mic capture, 24kHz for playback. WebAudio fixes sample rate per context, so you don’t get to use one.
ScriptProcessorNode, not AudioWorklet. Deprecated but universally supported. Thirty lines of inline PCM conversion versus a separate worklet file plumbed through the SPA serving. When Chrome actually drops it, swap.

That part shipped clean. The barge-in did not.

The barge-in bug

Symptom: the agent is speaking, the user talks over it to interrupt, and it just keeps reading. For ten, sometimes twenty seconds.

Everything else about voice mode could be perfect and this single behavior would still make it feel broken.

The real cause was a desync between two clocks.

The naive playback function scheduled every incoming audio chunk immediately, back-to-back, on the playback AudioContext. That sounds harmless, but Gemini streams a 25-second answer’s worth of audio in roughly a 3-second burst.

So within 3 seconds of the answer starting, 25 seconds of audio is sitting scheduled in the Web Audio graph. The user is hearing it, but as far as the server is concerned, the turn is done.

Now the timing falls apart:

Server emits turnComplete at second 3.
User starts talking at second 10, over audio they’re still hearing.
But the server’s turn finished 7 seconds ago.

Gemini’s docs are explicit: “When VAD detects an interruption, the ongoing generation is canceled.” The catch is the words ongoing generation. There isn’t one.

The user’s speech is read as the start of a fresh turn, not a barge-in. No interrupted event ever lands. stopPlayback never runs. The buffered audio plays out to the end of those 25 seconds, no matter how loudly the user is yelling at it.

The code wired to interrupted: true was correct. The server VAD was working. Every layer was right in isolation. The bug existed only in the relationship between the playback queue and the server’s turn clock — a class of bug that a type check or a single-file review will never catch.

What we tried

Two things, in order:

Hand-rolled RMS gate on the mic input. Tried twice, reverted both times. It dropped real speech (low-volume words, sentence starts) and was generally unreliable across mics and rooms. Not worth it.
Cap the playback queue depth. This was the fix.

Why we didn’t reach for a “real” client-side VAD model like Silero: the bug wasn’t on the mic side at all, it was on the playback side. A second VAD on the input wouldn’t have changed the symptom.

The fix

Don’t schedule audio more than about half a second ahead of realtime.

Concretely:

Chunks coming off the WebSocket land in a plain JS array (pendingChunksRef) instead of going straight into start() calls on the AudioContext.
A small scheduler — call it pumpPlayback — drains that queue just-in-time. It schedules the next chunk only if the current playback lead is below MAX_PLAYBACK_LEAD_SEC (0.5s), and otherwise stops.
It’s edge-triggered, not polled. It gets re-invoked when a new chunk arrives, and again when each scheduled AudioBufferSourceNode fires onended. The queue tops itself up as it drains, with no timer keeping the loop alive.

The behavior change after that flip: perceived playback now tracks the server’s turn within half a second.

When the user talks over the bot, the turn is still alive on the server, server-side VAD fires interrupted: true, the hook stops every scheduled source, clears the pending queue, resets the cursor, and the room goes quiet.

About thirty lines of code. No new dependencies. The actual audio routing is untouched.

Honest caveat: this makes barge-in work, it does not make it instant. You still pay one mic → server → VAD → interrupted → client round trip on top of the 0.5s lead cap, so the stop takes roughly 0.7 to 1 second. For a conversational product that’s fine.

Lessons I want to keep

Two things from this build worth carrying to the next voice agent.

When something feels broken at the integration layer, resist the urge to swap transports. The transport is almost never the problem. The bug usually lives in the relationship between two pieces of state that look fine on their own — here, the playback queue and the server’s turn clock. Drawing the timing on paper, with both clocks side by side, surfaces the bug in about a minute.

And: ship voice features behind a real person using them like a person, not behind unit tests. The whole bug was invisible to the type system, to the agent loop, to the deploy pipeline. The only thing that catches “the bot won’t shut up when I talk over it” is talking over the bot.

A template repo

I lifted the working parts of this into a small, self-contained React template so I don’t have to rebuild it for the next project — and so you can skip the queue-cap detour.

github.com/Nitzan94/gemini-live-voice-react

What’s in it:

src/useGeminiLiveVoice.ts — the hook itself, around 400 lines, zero dependencies beyond @google/genai. The queue-cap scheduler from this post is in here.
example/ — a complete demo with a Bun server, showing the right pattern for production: API key stays server-side, browser mints an ephemeral token through your own endpoint.
playground/ — a static variant deployable to Vercel. Users paste their own key and try it in-browser with no install; the key lives only in the browser and mints tokens directly against Google over CORS. Useful for a quick demo or for letting a teammate kick the tires before you wire up the server side.

If you build something on top of it, I’d love to hear what you used it for and where it broke.