Context windows are a leash. Every LLM has one, and every LLM eventually chokes on it. Recursive Language Models cut the leash entirely -- they let a model process inputs two orders of magnitude beyond its context window by treating the prompt as an external environment and recursively calling itself over snippets.
RLMs come from the MIT OASYS lab (Zhang, Kraska, Khattab). An 8B model post-trained with this approach outperforms vanilla GPT-5 on long-context tasks. Start with how it works, or see the benchmark results.
GPT-5 advertises a 272K token context window. Sounds generous. But feed it a task that requires dense reasoning over all 272K tokens -- not just finding a needle, but actually processing every line -- and performance falls off a cliff. This is called context rot, and every model suffers from it.
The industry response has been to make windows bigger. 1M tokens. 10M tokens. But bigger windows don't solve the fundamental issue: Transformers degrade on long, information-dense inputs regardless of what fits technically.
RLMs take a different approach. Instead of forcing the entire prompt through the neural network at once, they let the model programmatically examine, decompose, and recursively call itself over pieces of the input. The prompt lives in a REPL environment as a variable. The model writes code to slice it, process the slices, and aggregate results.
The result: effective processing of 10M+ token inputs. Not with summarization hacks or retrieval tricks. With actual dense semantic work across the entire input.
Two fundamentally different approaches to processing long inputs. Read the full comparison →
The decompose-recurse-aggregate pattern. How prompts become environment variables. Why symbolic recursion changes everything.
Programmatic examination, decomposition strategies, REPL environments, sub-call patterns. How RLM-Qwen3-8B was post-trained. RLMs vs RAG vs sliding window.
The arXiv paper dissected. Benchmark results across S-NIAH, OOLONG, BrowseComp-Plus, and CodeQA. RLM-Qwen3-8B vs GPT-5, head to head.
RLM-Qwen3-8B -- an 8-billion parameter model post-trained on just 1,000 samples -- outperforms the base Qwen3-8B by 28.3% on average across four diverse long-context benchmarks. It approaches the quality of vanilla GPT-5 on three of them.
At the frontier scale, RLM(GPT-5) maintains strong performance on inputs up to 2^18 tokens (262K+), while vanilla GPT-5 degrades sharply as inputs grow. On OOLONG-Pairs -- a task requiring quadratic-complexity reasoning -- GPT-5 scores less than 0.1% F1. The RLM version scores 58%.
The cost? Comparable. At the median, RLM runs are actually cheaper than base model calls on GPT-5, because the model selectively examines context rather than ingesting everything at once.
The ecosystem is already moving. DSPy (v3.1.2+) ships with built-in RLM support. Google's Agent Development Kit has an enterprise-ready implementation with lazy file loading and parallel sub-calls. VentureBeat, InfoQ, and Towards Data Science have all published deep dives in the last month. This isn't a paper that got filed away -- it's being adopted.
"It's a partially observable problem that you're giving the LM, where it can make logical decisions based on the structure of the task and context." -- Alex Zhang, MIT CSAIL