How a Language Model Gets Stuck in a Loop

I hit a repetition loop bug in an app: the model entered a loop, got stuck on the same Hebrew sentence, and kept repeating itself endlessly.

I had heard about this failure mode before, but this time I needed to understand it deeply: what exactly happened, why it happens, and how to respond to it in a real system.

I went into the details with Claude, and this is what came out: what happened, why language models can fall into repetition loops, how to tell whether an incident is a one-off or part of a recurring pattern, and what defense layers you can add.

What happened

A 7-minute spinner. Around 65,000 tokens of the same Hebrew sentence, over and over.

Query ID: 63de2625, encyclopedia book ilana5
Model: gemini-3.1-flash-lite-preview
Input: around 260,000 tokens of Hebrew encyclopedia text
Output: the same sentence ending in …, repeated thousands of times

That was the symptom. The rest of this post is the why, the data, and the right engineering response.

How language models generate text

A language model does not “write a sentence.” It chooses one token at a time.

At each step it asks, in effect: given everything before this, what comes next? It samples from a probability distribution over the vocabulary, appends the result to the context, and repeats.

The important consequence: the model’s own output becomes part of its input for the next token.

Under normal conditions, that is fine. It is how coherent paragraphs get built. But it also creates a vulnerability.

If the distribution peaks sharply around one token or phrase, each new output reinforces the distribution for the next step. That is a repetition loop.

This is positive feedback. Not “positive” as in good, but positive as in feedback that amplifies the current direction. The model emits X X X, the new context makes X even more likely, the distribution peaks harder, and the model digs in.

Three conditions that amplify loops

Not every model under every configuration loops. This failure needs a few amplifiers present at the same time.

Low-entropy decoding

Decoding is the policy that turns probabilities into actual text. Greedy decoding, for example, always picks the most likely token. It is stable and focused, but if the model starts repeating itself, there is almost no noise to help it escape. The closer your decoding is to greedy, the sharper the distribution and the less escape pressure.

Weaker models

Small or undertrained models produce sharper, less-calibrated distributions. They have not seen enough diverse continuations to spread probability evenly across plausible next tokens. They fall into attractors more easily, and once stuck, they have less probability mass on escape tokens.

Long contexts with repeated structure

When the input is a large body of text with a consistent style — a book index, an encyclopedia, repeated section headers — attention latches onto the pattern. The input teaches the model that repetition is normal.

The classic paper

Holtzman et al., 2020 — “The Curious Case of Neural Text Degeneration.”

This is the foundational reference for this failure mode. The paper proposed nucleus sampling, or top-p, as a fix. That fix became standard in most major LLM APIs.

It is not a complete solution, as this bug demonstrated, but it is why greedy decoding in real systems is now considered risky.

Nucleus sampling, or top-p, in plain English

Instead of always taking the most likely token, keep the smallest set of tokens whose probabilities sum to p — typically 0.9 — and randomly sample from that set.

The “nucleus” is the high-probability core of the distribution.

This gives the model enough randomness to escape weak attractors, while still keeping most of the weight on plausible continuations.

Where it shows up:

Common default in Anthropic, OpenAI, and Google Vertex
Usually you do not set it manually, because it is already on
You tune it only when you want a noticeably different output character: lower = more focused and more loop-prone; higher = more diverse

The lesson from this bug: even default top-p cannot necessarily save a weak model when the input distribution is already sharply peaked.

Top-p reduces the risk. It does not eliminate it when the amplifiers are fully engaged.

What caused this specific incident

All three amplifiers were present at once.

Weak model.
gemini-3.1-flash-lite-preview — the cheapest tier, chosen deliberately for cost. Sharper and less-calibrated than stronger models.

Long, repetitive input.
260,000 tokens of Hebrew encyclopedia text. Large body, consistent section structure. A more varied input would have kept the distribution flatter.

Default temperature.
No special low-temperature setting was applied, but under a weak model, with a context that already produced a sharp peak, defaults were not enough to prevent the attractor from forming.

The giveaway: every repeated sentence ended in a Hebrew ellipsis.

Once the model emitted one sentence ending in ..., the training distribution heavily weighted “another similar sentence ending in ...” as the continuation. Self-reinforcing from there.

What the data showed

Before treating this as a one-off, I looked at the app data itself: 30 days of encyclopedia_queries.

SELECT
  percentile_cont(0.50) WITHIN GROUP (ORDER BY output_tokens) AS p50,
  percentile_cont(0.90) WITHIN GROUP (ORDER BY output_tokens) AS p90,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY output_tokens) AS p99
FROM encyclopedia_queries
WHERE created_at > now() - interval '30 days'
  AND output_tokens IS NOT NULL;

The result:

Percentile	Tokens	Meaning
p50	592	Half of queries returned fewer than this. The “typical” answer.
p90	1,074	90% of queries were under this. The “long but normal” boundary.
p99	35,737	99% of queries were under this. Only the top 1% reaches this area.

In a healthy distribution, p99 should be around 2–3x the p90. Here it was 33x.

That is not “a long answer.” It is a contaminated tail: a few repetition loops from the last 30 days sitting at the edge and inflating p99.

In other words, this does not mean bad answers are happening all the time. The opposite: they are rare. But the outliers are so extreme that they no longer look like normal answers. They look like failure states.

Recurring, not one-off

Flash-Lite’s susceptibility to loops on long Hebrew contexts is a real system risk worth taking seriously.

Based on the data, the cadence is weeks-to-months, not days. It will happen again.

Layered safeguards are the right response to a recurring failure, not a postmortem for a freak event.

What the cache has to do with it

Nothing direct. The cache is innocent.

Prompt caching delivers a long prompt cheaply. That correlates with long-context queries, and long-context queries correlate with loops. But the cache itself does not touch the model’s output distribution.

The same loop could have happened without caching. It just would have cost more to get there.

Three layers of defense

No single fix eliminates loops. The right answer is defense in depth: layer cheap protections, and accept that each has a small failure mode the next layer covers.

Layer 1: Vendor-side token cap

Set maxOutputTokens on every model call. Hard ceiling: once the model hits it, the stream stops regardless of what it is doing.

This does not detect the loop. The model is still looping when it gets cut off. But it bounds the damage: a 7-minute spinner becomes a 30-second spinner.

One line of code. Free to add, catches everything.

Layer 2: Server-side rolling-buffer detector

As tokens stream out, maintain a rolling buffer of recent text. If the same short window appears N times in the last M characters, abort the stream.

This catches a real loop in roughly one second, instead of the roughly 30 seconds Layer 1 needs.

The detector has to be content-agnostic enough to avoid false positives on legitimate repetition: numbered lists, refrains, code with repeated structure. Tuning window size and threshold is the main concern.

And you can do something better than returning an error: stop the stream, run an internal retry with a safer configuration, and return a new answer to the user if the retry succeeds. But that has to be designed carefully around cost, quota, and ops logging.

Layer 3: Frontend “this looks broken” detector

If the rendered answer contains the same sentence more than three times, swap the message for a retry button.

This does not prevent the bad answer. It fixes the UX for the user who sees it. Backstop if Layers 1 and 2 both miss.

Is Layer 2 enough on its own?

Mostly yes, but you still want Layer 1.

Layer 1 is one line of code, and it catches two cases Layer 2 can miss:

Drifting loops — each repetition is slightly different: X1 Y1 X2 Y2 X3. An exact n-gram detector can miss this.
Buggy detector — the detector itself misses a real loop, or false-positives on real content.

The token cap is content-agnostic. It cannot misfire on a real answer, because real answers do not reach the cap: p90 is around 1,000, and the cap is 3,000.

That makes it uniquely trustworthy as a backstop. No failure mode on valid output, only on looping output. That is why you ship both.

The real fix

All three layers are damage control. The quality fix is upgrading the answer-tier model.

Better models do not eliminate loops. They push them from “every few weeks” to “very rare.”

The tradeoff is cost. Running both router and answer on a cheaper model was a deliberate product decision. That can be the right call, but then loops have to be treated as an engineering risk with hard boundaries.

Bottom line: a repetition loop is not just “the model freaked out.” It is a system failure created by the interaction between decoding, weaker models, long repetitive context, and missing hard caps. The fix is not to hope it will not happen again. The fix is to build the system as if it will.