Based on InfiAgent research (January 2026) and insights from building production voice agent systems


The Problem No One Talks About

Anyone who’s built an AI agent knows the story: the system works great on short tasks, but the moment you give it something longer - it starts “forgetting,” gets confused, and produces weird results.

This isn’t a bug. It’s an architectural problem called Unbounded Context Growth.

What Actually Happens

When an agent runs a task, it accumulates history:

  • Every action it performed
  • Every response from tools
  • Every intermediate decision
  • Every error and correction

The LLM’s context window fills up. Then two bad things happen:

1. Error Accumulation - Small mistakes compound. An agent that forgot a small detail at step 3 makes a wrong decision at step 10, which causes failure at step 20.

2. Memory Loss - The LLM loses important information from the beginning of the conversation because it gets pushed out by newer information.

Steps 1-10: Works great Steps 11-30: Starts getting confused Steps 31+: Accumulated errors, irrelevant results

Why This Is an “Architectural Problem”

When I say “architectural problem that requires an architectural solution” - I mean this isn’t something you’ll fix with a better prompt or a stronger model.

It’s like traffic jams: you won’t solve them with a faster car. You need to change the roads.

Similarly, unbounded context growth is a problem with how the system is built. Even GPT-5 with a million-token context window will face the same issue - just later.

Common Solutions - And Why They Don’t Work

Context Compression - Compress history into a shorter summary. The problem: You lose critical information. What seems unimportant at step 5 might be critical at step 50.

RAG (Retrieval-Augmented Generation) - Store everything in a vector DB and retrieve by relevance. The problem: The agent doesn’t always know what’s relevant. Semantic search doesn’t capture logical connections.

Sliding Window - Keep only the last N actions. The problem: You lose the broader context. The agent forgets why it’s doing what it’s doing.

The New Approach: State Externalization

New research called InfiAgent (January 2026, Hong Kong Polytechnic University) proposes a completely different approach:

Don’t try to maintain a long context - externalize the state.

Instead of cramming everything into the prompt, the agent maintains a workspace of files representing the task state.

How Does the Context Stay Fixed?

The key trick: Every step, the context is rebuilt from scratch.

Step 47: +------------------------------------+ | System prompt (fixed) | | Workspace snapshot (current state) | | Steps 42-46 only (recent window) | +------------------------------------+

Step 48: +------------------------------------+ | System prompt (same) | | Workspace snapshot (updated) | | Steps 43-47 only (window shifts) | +------------------------------------+

The full history? Saved in files or DB - but doesn’t enter the prompt.

The agent “remembers” through the workspace snapshot, not through conversation history.

Traditional Agent: +-------------------------------------+ | System Prompt | | + Step 1 result | | + Step 2 result | | + Step 3 result | | + … (grows forever) | | + Step N result | Context explodes +-------------------------------------+

InfiAgent Architecture: +-------------------------------------+ | System Prompt | | + Workspace Snapshot (current) | | + Last 5 actions only | Fixed size always +-------------------------------------+ | +-------------------------------------+ | External Workspace (files/DB) | | - Full history | | - Intermediate results | | - Decision logs | +-------------------------------------+

External Attention Pipeline - Sub-Agents for Reading

InfiAgent adds a component called External Attention Pipeline.

If you’re familiar with Claude Code - it’s exactly the same principle.

When Claude Code needs to read a large file, it doesn’t load everything into context. It uses a sub-agent that reads the file and returns only what’s relevant to the question.

InfiAgent works the same way: when the agent needs to read an 80-page PDF, it doesn’t load it into its context. Instead:

  1. Calls a dedicated tool (answer_from_pdf)
  2. The tool runs a separate LLM just on that document
  3. Returns only the relevant answer

+-------------------------------------+ | Main Agent | | “What’s the main conclusion of X?” | +------------------+------------------+ | v +-------------------------------------+ | Sub-Agent (answer_from_pdf) | | Reads the entire PDF | | Returns: “The conclusion is…” | +------------------+------------------+ | v +-------------------------------------+ | Main Agent | | Gets only the short answer | | Continues to next step | +-------------------------------------+

The main agent stays “light” - it doesn’t carry the weight of every document it read.

The Proof: 80 Academic Papers

InfiAgent was tested on a real task: read 80 academic papers and write a comprehensive literature review.

Results:

  • Regular agents crashed after ~30 papers
  • InfiAgent finished all 80 - maintaining consistent quality

And here’s the impressive part: this works without task-specific fine-tuning. The same architecture works on any domain.

A 20B parameter model with InfiAgent competed with much larger proprietary models - just because of the architecture.

Practical Implementation: What to Actually Do

The Principle

Instead of accumulating all history in the prompt, each step you rebuild the context from just two things:

  1. Snapshot of current state - what the agent knows now, what it already decided, what’s left to do
  2. Small window of recent actions - say 5 steps back, no more

The full history is saved on the side (DB or files) - but doesn’t enter the prompt.

Four Principles:

1. Separate state from history - Current state is small and focused: “what stage am I at, what did I decide, what’s left.” The full history of how I got there - saved separately.

2. Structure over text - JSON with clear fields beats free text. Easier to read, easier to update, easier to query.

3. Save the “why” - Not just what the agent decided, but why. This helps it continue in the right direction even without remembering the full history.

4. Sub-agents for reading - Like Claude Code, when you need to process a large document - let a separate agent read it and return only what’s relevant. The main agent stays light.

When Is This Relevant?

Use state externalization when:

  • Long tasks (dozens of steps and up)
  • Multi-agent systems passing information between them
  • Processes that need to maintain consistency over time
  • Agents processing many documents

Not necessary when:

  • Short, focused conversations
  • Simple one-off tasks
  • Context window is big enough for the task

Bottom Line

The unbounded context problem isn’t a bug you can fix with a better prompt or stronger model. It’s how the system is built. Just like you won’t solve traffic jams with a faster car - you need to change the roads.

InfiAgent proved this with 80 papers. Now the question is just how to implement it in your system.


Further Reading: