Based on InfiAgent research (January 2026) and insights from building production voice agent systems
The Problem No One Talks About
Anyone who’s built an AI agent knows the story: the system works great on short tasks, but the moment you give it something longer - it starts “forgetting,” gets confused, and produces weird results.
This isn’t a bug. It’s an architectural problem called Unbounded Context Growth.
What Actually Happens
When an agent runs a task, it accumulates history:
- Every action it performed
- Every response from tools
- Every intermediate decision
- Every error and correction
The LLM’s context window fills up. Then two bad things happen:
1. Error Accumulation - Small mistakes compound. An agent that forgot a small detail at step 3 makes a wrong decision at step 10, which causes failure at step 20.
2. Memory Loss - The LLM loses important information from the beginning of the conversation because it gets pushed out by newer information.
Steps 1-10: Works great Steps 11-30: Starts getting confused Steps 31+: Accumulated errors, irrelevant results
Why This Is an “Architectural Problem”
When I say “architectural problem that requires an architectural solution” - I mean this isn’t something you’ll fix with a better prompt or a stronger model.
It’s like traffic jams: you won’t solve them with a faster car. You need to change the roads.
Similarly, unbounded context growth is a problem with how the system is built. Even GPT-5 with a million-token context window will face the same issue - just later.
Common Solutions - And Why They Don’t Work
Context Compression - Compress history into a shorter summary. The problem: You lose critical information. What seems unimportant at step 5 might be critical at step 50.
RAG (Retrieval-Augmented Generation) - Store everything in a vector DB and retrieve by relevance. The problem: The agent doesn’t always know what’s relevant. Semantic search doesn’t capture logical connections.
Sliding Window - Keep only the last N actions. The problem: You lose the broader context. The agent forgets why it’s doing what it’s doing.
The New Approach: State Externalization
New research called InfiAgent (January 2026, Hong Kong Polytechnic University) proposes a completely different approach:
Don’t try to maintain a long context - externalize the state.
Instead of cramming everything into the prompt, the agent maintains a workspace of files representing the task state.
How Does the Context Stay Fixed?
The key trick: Every step, the context is rebuilt from scratch.
Step 47: +------------------------------------+ | System prompt (fixed) | | Workspace snapshot (current state) | | Steps 42-46 only (recent window) | +------------------------------------+
Step 48: +------------------------------------+ | System prompt (same) | | Workspace snapshot (updated) | | Steps 43-47 only (window shifts) | +------------------------------------+
The full history? Saved in files or DB - but doesn’t enter the prompt.
The agent “remembers” through the workspace snapshot, not through conversation history.
Traditional Agent: +-------------------------------------+ | System Prompt | | + Step 1 result | | + Step 2 result | | + Step 3 result | | + … (grows forever) | | + Step N result | Context explodes +-------------------------------------+
InfiAgent Architecture: +-------------------------------------+ | System Prompt | | + Workspace Snapshot (current) | | + Last 5 actions only | Fixed size always +-------------------------------------+ | +-------------------------------------+ | External Workspace (files/DB) | | - Full history | | - Intermediate results | | - Decision logs | +-------------------------------------+
External Attention Pipeline - Sub-Agents for Reading
InfiAgent adds a component called External Attention Pipeline.
If you’re familiar with Claude Code - it’s exactly the same principle.
When Claude Code needs to read a large file, it doesn’t load everything into context. It uses a sub-agent that reads the file and returns only what’s relevant to the question.
InfiAgent works the same way: when the agent needs to read an 80-page PDF, it doesn’t load it into its context. Instead:
- Calls a dedicated tool (
answer_from_pdf) - The tool runs a separate LLM just on that document
- Returns only the relevant answer
+-------------------------------------+ | Main Agent | | “What’s the main conclusion of X?” | +------------------+------------------+ | v +-------------------------------------+ | Sub-Agent (answer_from_pdf) | | Reads the entire PDF | | Returns: “The conclusion is…” | +------------------+------------------+ | v +-------------------------------------+ | Main Agent | | Gets only the short answer | | Continues to next step | +-------------------------------------+
The main agent stays “light” - it doesn’t carry the weight of every document it read.
The Proof: 80 Academic Papers
InfiAgent was tested on a real task: read 80 academic papers and write a comprehensive literature review.
Results:
- Regular agents crashed after ~30 papers
- InfiAgent finished all 80 - maintaining consistent quality
And here’s the impressive part: this works without task-specific fine-tuning. The same architecture works on any domain.
A 20B parameter model with InfiAgent competed with much larger proprietary models - just because of the architecture.
Practical Implementation: What to Actually Do
The Principle
Instead of accumulating all history in the prompt, each step you rebuild the context from just two things:
- Snapshot of current state - what the agent knows now, what it already decided, what’s left to do
- Small window of recent actions - say 5 steps back, no more
The full history is saved on the side (DB or files) - but doesn’t enter the prompt.
Four Principles:
1. Separate state from history - Current state is small and focused: “what stage am I at, what did I decide, what’s left.” The full history of how I got there - saved separately.
2. Structure over text - JSON with clear fields beats free text. Easier to read, easier to update, easier to query.
3. Save the “why” - Not just what the agent decided, but why. This helps it continue in the right direction even without remembering the full history.
4. Sub-agents for reading - Like Claude Code, when you need to process a large document - let a separate agent read it and return only what’s relevant. The main agent stays light.
When Is This Relevant?
Use state externalization when:
- Long tasks (dozens of steps and up)
- Multi-agent systems passing information between them
- Processes that need to maintain consistency over time
- Agents processing many documents
Not necessary when:
- Short, focused conversations
- Simple one-off tasks
- Context window is big enough for the task
Bottom Line
The unbounded context problem isn’t a bug you can fix with a better prompt or stronger model. It’s how the system is built. Just like you won’t solve traffic jams with a faster car - you need to change the roads.
InfiAgent proved this with 80 papers. Now the question is just how to implement it in your system.
Further Reading: