Recursive language models — layers of thought

VZ editorial frame

Read this piece through one operating lens: AI does not automate first, it amplifies first. If the underlying decision architecture is clear, AI scales clarity. If it is noisy, AI scales noise and cost.

VZ Lens

From a VZ lens, this piece is not for passive trend tracking - it is a strategic decision input. Zhang and Khattab’s 2025 RLM study showed that expanding the context window is a dead end. The solution isn’t training your eyes—it’s work organization. Strategic value emerges when insight becomes execution protocol.

TL;DR

TL;DR: The context window of large language models is finite, and simply expanding it doesn’t solve anything because attention gets diluted. Recursive Language Models (RLM) don’t increase the context; instead, they teach the model to call itself: it breaks down the task and builds the whole from partial answers. This isn’t a prompt trick, but a way of organizing work that brings surprisingly human strategies out of machines.

The Librarian Who Sees Thirty Pages

The context window of large language models is finite, and expanding it alone solves nothing because attention is diluted. Recursive Language Models (RLM)—a 2025 paper by Zhang and Khattab—don’t increase the context; instead, they teach the model to call itself: it breaks down the task, builds a whole from partial answers, and in the process adopts surprisingly human strategies.

There’s an image that won’t leave me alone. A librarian sits in an infinite library—Borges The Library of Babel, where every book that has ever been written or will ever be written exists. The librarian can read anything, but there is a peculiar limitation: he can see only thirty pages at a time. If he tries to take in more, the previous lines fade, become jumbled, and eventually fill with emptiness.

In Budapest, on the third floor of the Szabó Ervin Library, I once experienced exactly this, only in a human version. I was sitting at a table with three months’ worth of research notes, and I caught myself realizing that by the time I opened the fourth stack, I couldn’t remember a thing from the first. Not because I have a bad memory. Because human attention—and machine attention—doesn’t work by holding everything at once. It works by filtering. And if the cost of filtering exceeds a threshold, it would rather start over.

This librarian is the large language model (LLM). The thirty pages are the context window. Researchers call this phenomenon context rot—and it’s not a bug, but a deeper fact about the nature of the system.

Why does the language model “dumb down” during long sessions?

Context rot is a peculiar phenomenon. It often goes unnoticed in tests. Modern models perform excellently on “needle in a haystack” type tasks—yet every researcher knows that moment when, during a longer session, the system simply seems to dumb down. Responses become more general, details disappear, and information mentioned earlier in the context no longer significantly affects the output.

This is not a memory problem in the traditional sense of the word. The tokens are there; in a technical sense, they are “visible.” The question is rather what it means for a system to see something when the number of things seen exceeds a threshold. The transformer’s attention mechanism does not select like a strict librarian who bookmarks the important lines. Rather, it acts like an overly polite conference moderator who gives everyone a small microphone—and then everyone falls silent at once.

Attention dilutes like ink in water.

In another story by Borges, Funes remembered everything perfectly—but he couldn’t think, because abstraction requires forgetting. Context rot is the opposite of this: the problem isn’t forgetting, but the lack of selection. Both approaches lead to dysfunction.

The idea of self-referentiality—when the mirror reflects an infinite number of mirrors

Zhang and Khattab’s 2025 paper offers a surprisingly elegant solution to this problem. Their solution is called Recursive Language Models, or RLM for short. The method allows the language model to call itself—or other models—repeatedly, as many times as necessary, until the response is formed.

The librarian didn’t train its eyes to see better. It learned how to send other librarians to the shelves in its place and how to piece together the report from the notes they bring back.

The user asks a question and provides the context—whether it’s a hundred documents, a thousand pages of notes, or an entire research project. The root model (the instance at depth 0) does not try to see the whole thing at once. Instead, the context lives in a variable; the model can write code, run it, look into the text, search within it, slice it up—and, crucially, make recursive calls to handle smaller details. The submodels receive a text segment and a subtask, return a partial answer, and the root model assembles these, verifies them, and, if necessary, initiates new rounds.

From the user’s perspective, this still gives the illusion of a single model call. The difference happens under the hood.

Four Strategies Nobody Taught

One exciting aspect of RLM’s behavior is that the root model adopts strategies that are surprisingly human-like. Not because it is “human,” but because the structure of the task forces it to do so.

Peeking. The model looks at a small sample to understand the structure of the context. Is there a header, does the pattern repeat, is the row tabular, is the text narrative?

Pattern-search narrowing. Keywords, regular patterns, simple heuristics. This step is important because it’s cheap. The model doesn’t “understand” semantically yet—it just narrows the scope so that the expensive processing isn’t applied to the entire dataset.

Divide and conquer. Breaking down the context, assigning subtasks to subroutines, and finally aggregating the results. This logic bears a striking resemblance to the classic map-reduce approach—only here, map is semantic tagging or mini-aggregation, and reduce is the assembly of the response.

Programmatic processing. If the task can be solved deterministically—applying diffs, performing calculations, or carrying out regular transformations—then the model sometimes simply writes code and has it execute the work. In such cases, it is not “thinking” that is enhanced, but rather work organization that becomes more professional.

These four strategies also highlight why we’re not just talking about mere prompt tricks. Prompting transforms into process design. The question isn’t how elegant an instruction you write, but what workspace and operational rules you provide to the model.

How does RLM differ from RAG and agents?

RAG (Retrieval-Augmented Generation) is a cataloging system for large document collections. First, you organize and index the data; then, when a query is made, you retrieve only the relevant pieces and feed them to the model. In contrast, RLM builds a “working catalog” for itself on the fly, temporarily and contextually—small lists, hits, and partial results. RAG is the shelving system; RLM is the pagination strategy.

Agent-based systems break down tasks, use tools, and take action. RLM has a different focus: it does not open up to the world, but delves deeper into the input data. Agents use a problem-centered breakdown, while RLM uses a context-centered one. The two approaches are not opposed to each other—in a good system, the agent provides the search and the schedule, while RLM is the component that interprets the retrieved material.

Chain-of-Thought encourages step-by-step thinking within a single call. RLM allows for multiple calls, and the steps are not only conceptual but also operational. CoT scales thought, while RLM scales workflow. In research, this is often a bigger difference than it first appears.

Key Takeaways

Expanding the context window is necessary but not sufficient. The dilution of attention is a real constraint—RLM does not expand, but organizes.
Recursion is not new. It has always been present in mathematics, linguistics, and computer science. What is new is that when applied to language models, emergent strategies—ones that no one has taught—appear.
RAG, agents, RLM—it’s not an either-or question. The three approaches cover different layers, and in the best systems, they will work together.
The infinite does not exist in the sense that we keep everything in mind at once. It exists in the sense that there is a next step, and the next step is accessible. This way of thinking has always been a hallmark of good research.

Frequently Asked Questions

What is context rot, and why is it a real problem?

Context rot is the phenomenon where a language model appears to “dumb down” during longer sessions: responses become more general, details disappear, and information mentioned early in the context no longer significantly influences the output. This is not a memory error in the traditional sense, but rather a dilution of attention. The larger the context window, the less attention each token receives, and this leads to a decline in the quality of responses.

How does RLM differ from RAG and agent-based systems?

RAG pre-indexes documents and provides the model with the retrieved segments when a question is asked. RLM dynamically builds its own working directory on the fly. Agent-based systems interact with the world, use tools, and take actions. RLM delves into the input material, not the outside world. In the best systems, all three work together: the agent searches, the RAG indexes, and the RLM interprets the found material.

How can the recursive approach produce human thinking patterns from a machine?

The structure of the task forces it. When the model encounters too much data, it spontaneously adopts strategies: peeking (a small sample to understand the structure), pattern-search narrowing, divide-and-conquer segmentation, and programmatic processing. No one explicitly taught it these strategies. In RLM, the transformation of prompting into process design gives rise to these emergent behaviors.

Zoltán Varga - LinkedIn Neural • Knowledge Systems Architect | Enterprise RAG Architect PKM • AI Ecosystems | Neural Awareness • Consciousness & Leadership Infinite context is not a matter of scale—it is a matter of navigation.

Strategic Synthesis

Translate the thesis into one operating rule your team can apply immediately.
Use explicit criteria for success, not only output volume.
Use a two-week cadence to update priorities from real outcomes.

Next step

If you want your brand to be represented with context quality and citation strength in AI systems, start with a practical baseline and a priority sequence.

Start with AI Scorecard Browse Hungarian originals