Recursive Language Models

Stop cramming. Start recursing.

Context windows are a leash. Every LLM has one, and every LLM eventually chokes on it. Recursive Language Models cut the leash entirely -- they let a model process inputs two orders of magnitude beyond its context window by treating the prompt as an external environment and recursively calling itself over snippets.

RLMs come from the MIT OASYS lab (Zhang, Kraska, Khattab). An 8B model post-trained with this approach outperforms vanilla GPT-5 on long-context tasks. Start with how it works, or see the benchmark results.


The problem

Context windows are a lie

GPT-5 advertises a 272K token context window. Sounds generous. But feed it a task that requires dense reasoning over all 272K tokens -- not just finding a needle, but actually processing every line -- and performance falls off a cliff. This is called context rot, and every model suffers from it.

The industry response has been to make windows bigger. 1M tokens. 10M tokens. But bigger windows don't solve the fundamental issue: Transformers degrade on long, information-dense inputs regardless of what fits technically.

RLMs take a different approach. Instead of forcing the entire prompt through the neural network at once, they let the model programmatically examine, decompose, and recursively call itself over pieces of the input. The prompt lives in a REPL environment as a variable. The model writes code to slice it, process the slices, and aggregate results.

The result: effective processing of 10M+ token inputs. Not with summarization hacks or retrieval tricks. With actual dense semantic work across the entire input.

The key difference

LLM vs RLM

Two fundamentally different approaches to processing long inputs. Read the full comparison →

Large Language Model
272K tokens entire input LLM single forward pass attention over ALL tokens ⚠ degrades Output quality on long inputs
Recursive Language Model
10M+ tokens stored as env variable decompose LLM chunk 1 LLM chunk 2 LLM chunk N recurse aggregate REPL Environment merge partial results Output quality on long inputs
Start here

The complete picture, in five parts.

The headline numbers

This isn't theoretical

RLM-Qwen3-8B -- an 8-billion parameter model post-trained on just 1,000 samples -- outperforms the base Qwen3-8B by 28.3% on average across four diverse long-context benchmarks. It approaches the quality of vanilla GPT-5 on three of them.

At the frontier scale, RLM(GPT-5) maintains strong performance on inputs up to 2^18 tokens (262K+), while vanilla GPT-5 degrades sharply as inputs grow. On OOLONG-Pairs -- a task requiring quadratic-complexity reasoning -- GPT-5 scores less than 0.1% F1. The RLM version scores 58%.

The cost? Comparable. At the median, RLM runs are actually cheaper than base model calls on GPT-5, because the model selectively examines context rather than ingesting everything at once.

The ecosystem is already moving. DSPy (v3.1.2+) ships with built-in RLM support. Google's Agent Development Kit has an enterprise-ready implementation with lazy file loading and parallel sub-calls. VentureBeat, InfoQ, and Towards Data Science have all published deep dives in the last month. This isn't a paper that got filed away -- it's being adopted.

"It's a partially observable problem that you're giving the LM, where it can make logical decisions based on the structure of the task and context." -- Alex Zhang, MIT CSAIL