Generating two-host podcasts from app output with Gemini

I added a podcast generation engine to one of my apps, alongside the live voice agent I wrote about last week.

The idea is simple: the app already produces a long, structured, fact-checked document. Reading it is one way to consume that. Listening to two people talk through it on a walk is another. So the feature takes any output the app generated and turns it into a roughly ten-minute conversation between two hosts — a script, then real audio, no human in the loop after the click.

What surprised me building it is how little human authorship it takes. There’s exactly one artifact a person wrote, and it’s about thirty lines long.

The only thing a human wrote

The entire authorial contribution is a single prompt. It does four things, and three of them are about saying no.

It casts two characters. “You are writing a two-host podcast script.” Host A leads the conversation and asks the sharp follow-up; Host B is the analytical one who brings specific numbers, dates, and names, and pushes back when a claim feels thin. So the dynamic — interviewer versus skeptic — is human-designed. The words they actually say are not.

It sets guardrails against AI slop. This is most of the prompt, and it’s all negative space: don’t open with “welcome back,” don’t use filler like “great point” or “what’s fascinating,” no sign-off, no music cues, no stage directions. Anyone who’s heard an AI podcast knows the disease — the breathless “Wow, that’s fascinating!” between every exchange. The prompt is a list of vaccinations against it.

It pins everything to facts. “Do not invent facts. If the source omits something, omit it from the script.” Same discipline as the document it’s summarizing. The model is allowed to restage knowledge as dialogue, not to add any.

It forces a machine-readable shape. The output isn’t prose, it’s strict JSON — [{"speaker": "Alex", "text": "..."}]. That last constraint looks like a formatting nicety but it’s load-bearing, and I’ll come back to why.

Who writes the content

Gemini Pro writes it. The model gets that prompt plus the body of the app’s output and returns the dialogue. Temperature 0.4 — deliberately low, because the job is “restage these facts as a conversation,” not “be creative.” The output is around 1,100–1,500 words, which lands at roughly eight to ten minutes of audio.

So the lineage of a single sentence in the finished podcast is: the app produced a verified claim, then the model rephrased that claim as something a host says out loud. Zero human sentences anywhere in the audio.

A few things I learned

gemini-2.5-flash-preview-tts is a generative speech model, not the old robotic flat kind. It doesn’t stitch together pre-recorded syllables. It generates the audio waveform from scratch, and to do that well it has to read the sentence and decide how a human would say it — where to put the emphasis, when to pause, whether a line is a genuine question or a jab. “You said what?” comes out differently from “you said what.” It interprets delivery. What it does not do is reason about the content: it can’t add a fact or reorder the conversation. One model thinks about what to say, a second model thinks only about how to say it. Two narrow jobs, not one model doing both.

The contract between the two models is a single JSON field. The script model labels every line with a "speaker". The TTS step is configured with exactly two voice mappings — "Alex" → one voice, "Maya" → another — and that mapping is the only thing it knows. It’s also a hard limit: multi-speaker TTS supports a maximum of two speakers.

This is where that “strict JSON” constraint pays off, and it’s the most interesting failure mode in the whole pipeline. The synthesis step does no validation of its own — it blindly renders whatever "speaker" string each line carries. If the script model ever hallucinated a third host, or misspelled "Alexx", there would be no voice to assign that line. The synthesis step would forward garbage.

The only thing standing between a hallucinated name and a broken audio call is an allow-list check at the parse boundary: before any audio is generated, every speaker label is checked against the two allowed names, and anything else throws and retries. The design stance is prevent, don’t recover — validate hard once, at the seam, then let every downstream step trust the data. The two models never talk to each other; they cooperate through one validated field.

Voices are swappable. In the app I can pick which two voices a podcast uses, and audition pairs before committing. The host names in the script and the voice mappings in the synthesis step are joined by that same speaker string, so changing a voice is a config change, not a code change.

The template repo

Once it worked, I pulled the reusable parts into a small standalone repo so I don’t rebuild this for the next project.

github.com/Nitzan94/gemini-podcast

What it gives you is less “a podcast generator” and more a pattern: structured multi-speaker orchestration, a validated contract between two models, a TTS routing layer, and a speaker abstraction that makes the rendering pipeline voice-swappable.

That pattern generalizes well past podcasts. The same script-then-render shape, with a validated speaker contract in the middle, covers training simulations, AI debates, educational explainers, customer briefings, financial summaries, multilingual narration, synthetic interview generation, and accessibility layers — anywhere you want structured content performed as multi-voice audio.

A playground

There’s also a hosted playground if you just want to hear it. Paste your own Gemini API key, give it some text, pick two voices, and it generates the conversation in-browser. The key stays in your browser.

gemini-podcast.vercel.app

If you build something on top of it, I’d like to hear what you used it for and where it broke.