The AI That Pauses to Think: How Interleaved Reasoning Is Reshaping Autonomous Agents

When Moonshot AI demonstrated its Kimi K2 model tackling a PhD-level mathematics problem in hyperbolic geometry, according to examples published in their technical documentation, the AI didn't just compute an answer. It embarked on a 23-step journey: searching academic literature, running calculations, reconsidering its approach based on results, querying databases again, and iterating until it found the solution. Each step informed the next. Each tool call triggered fresh reasoning.

This capability—what researchers call interleaved reasoning (also known as interleaved thinking)—represents a qualitative shift in how AI systems operate. Unlike traditional models that generate a complete chain of thought before acting, interleaved reasoning lets models alternate between thinking and acting: reason a bit, call a tool (e.g., API calls, browser sessions, code execution), ingest the result, update the plan, repeat. The key innovation: the internal chain-of-thought is preserved across tool calls and fed back into subsequent steps, keeping the entire reasoning trace live as context.

Models such as MiniMax M2 and Kimi K2 Thinking claim to support 200 to 300 sequential tool calls without human intervention, solving problems that would have been impossible just months ago. But as the AI industry rushes to implement this approach, a messy reality is emerging: most of today's infrastructure was built for stateless chatbots, not for agents that stream their internal monologue.

More Than Chain-of-Thought: When State Becomes Memory

In practice, this "state" is just more tokens—plans, hypotheses, constraints, intermediate conclusions—that later steps can attend to. But that simple technical fact has profound implications.

Classic chain-of-thought (CoT) prompts ask models to produce a reasoning trace, but typically in a one-shot fashion before producing an answer. While researchers have bolted external tools around CoT for years, the difference with interleaved reasoning is first-class support at both the API and model level for:

Multiple visible thinking segments
Interspersed with tool calls
With that thinking fed back as state for subsequent steps

"Agents require interleaved thinking. Complex agent tasks have extremely long contexts. A single thought process at the start isn't enough to maintain instruction-following and coherence."
— MiniMax AI team, "Aligning to What?"

"Agents require interleaved thinking," writes the MiniMax team in their technical post-mortem "Aligning to What? Rethinking Agent Generalization in MiniMax M2." "Complex agent tasks have extremely long contexts. A single thought process at the start isn't enough to maintain instruction-following and coherence."

More crucially, agent tasks introduce constant, unpredictable perturbations from the outside world—tool outputs, API responses, file system changes. Swap the toolset, change a system prompt, break network connectivity mid-task: the model must handle these perturbations, diagnose errors, and extract useful information. Interleaved reasoning allows the model to constantly re-evaluate and adapt.

When prior state is dropped, cumulative understanding breaks down. State drift increases. Self-correction weakens. Planning degrades, especially on long-horizon tool chains and run-and-fix loops.

The Benchmark Case: When Memory Actually Matters

MiniMax ran a controlled experiment to quantify this effect, comparing their M2 model's performance with thinking state preserved versus discarded across multiple benchmarks. The results, published in their November 2025 technical announcement "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability," show dramatic differences:

Benchmark	With State	Without State	Improvement
SWE-Bench Verified	69.4%	67.2%	+3.3%
Tau²	87	64	+35.9%
BrowseComp	44.0%	31.4%	+40.1%
GAIA	75.7%	67.9%	+11.5%
xBench	72.0%	66.0%	+9.1%

That 40% improvement on BrowseComp—a benchmark testing AI's ability to continuously browse, search, and reason over hard-to-find web information—is particularly striking. As MiniMax puts it: "Reliability isn't just about what the model has inferred so far; it's about whether the model can revisit and revise what it thought before."

The performance extends beyond controlled benchmarks. Anthropic reports that Claude Sonnet 4.5, using an agentic setup with extended thinking and parallel test-time compute enabled, achieves strong performance on SWE-bench Verified—a benchmark requiring models to fix real bugs in real GitHub repositories. Moonshot claims Kimi K2 Thinking achieves 44.9% on Humanity's Last Exam, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified, according to their published results following the Artificial Analysis methodology.

As always, caution is needed—benchmarks differ in evaluation methodology, prompt engineering, and test-time compute settings, making cross-model comparisons approximate at best.

Who's Shipping What: The Interleaved Timeline

Anthropic broke ground in May 2025 with Claude 4 Opus and Sonnet. Extended thinking with tool use—documented in their official Claude API documentation—allowed Claude to alternate between reasoning and tool use, with thinking blocks preserved across tool calls. The feature is enabled through a beta API header.

By September 2025, Anthropic released Claude Sonnet 4.5, which further refined these capabilities, company materials report.

But the technology didn't stay proprietary. MiniMax's M2, released in October 2025, became the first major open-weight model to expose interleaved reasoning officially. The 230-billion-parameter mixture-of-experts architecture (activating just 10 billion parameters at inference) wraps reasoning in <think>...</think> tags and expects applications to feed those back on every turn, as detailed in Simon Willison's technical analysis and MiniMax's own documentation.

Then in November 2025 came Moonshot AI's Kimi K2 Thinking. The model was reported to cost around $4.6 million to train (per CNBC via an anonymous source—unverified but widely cited), representing a tiny fraction of what frontier labs typically spend. The model claims to match or exceed GPT-5 and Claude Sonnet 4.5 on selected reasoning and agentic benchmarks, according to Moonshot's published results, though direct comparisons remain challenging given different evaluation protocols.

Kimi K2 uses INT4 quantization-aware training, achieving roughly 2× generation speed improvement while maintaining competitive benchmark performance—with all reported scores using this optimized inference, avoiding any bait-and-switch between evaluation and production.

OpenAI took a different philosophical path. Their o-series reasoning models use what they call a "private chain of thought"—the reasoning process remains invisible to users, handled entirely server-side. You see the final answer and tool uses, but not the thinking between them.

The Infrastructure Trap: When Frameworks Lobotomize Agents

Here's an uncomfortable detail about the current AI stack: most tooling was built for stateless chatbots, not for agents that stream their internal monologue.

For o-series and GPT-5 reasoning models, OpenAI hides reasoning tokens entirely. The Chat Completions API never sees them; reasoning happens server-side in what OpenAI documents as a "private chain of thought." This is an architectural and policy choice—users get the final answer and maybe some tool uses, but not the reasoning between them.

But for models that do emit visible thinking—M2, K2, some Qwen variants—most popular OpenAI-style stacks assume that reasoning is either hidden or irrelevant. Visible thinking tags often get discarded by middleware built for /chat/completions, so prior reasoning never makes it back into the context.

MiniMax's M2, for example, wraps its reasoning in <think>...</think> blocks and expects applications to feed those back on every turn. If your framework helpfully strips "non-user-facing" content, you've just lobotomized your agent. Long-horizon benchmarks like BrowseComp crater when prior thoughts vanish, as MiniMax documents in their implementation guide.

One developer working with M2 described the experience: "I built a 120-step agent workflow for code analysis, and performance mysteriously collapsed. Took me two days to realize my UI framework was stripping out the <think> blocks. The model wasn't broken—my infrastructure was throwing away its memory."

The fix is mundane—preserve and replay the thinking trace—but MiniMax reports that much of the community feedback about M2's performance gaps stems from accidentally discarding this vital context. Applications blame the model instead of the middleware.

To address this, MiniMax has built dedicated support into their OpenAI-compatible API: the model's reasoning process now returns in a separate reasoning_details field rather than mixed with content. Passing this field back in subsequent requests maintains the complete chain of thought across multiple tool calls. They're working with ecosystem partners, including OpenRouter, Ollama, Vercel, and Cline, to test and implement interleaved reasoning correctly, aiming to establish unified protocol standards.

The Cost of Thinking: Token Economics and Cache Invalidation

This power comes with a price tag. Extended thinking with interleaved tool use can consume 5–10× more tokens than standard mode, according to MiniMax's documentation and analysis of reasoning model economics. When you're paying per million tokens, that arithmetic matters.

The budget_tokens parameter can exceed the max_tokens limit when using interleaved reasoning with tools—it represents the total budget across all thinking blocks within one assistant turn, potentially reaching the entire 200K token context window, as detailed in Claude's extended thinking documentation.

Then there's cache invalidation. Changes to thinking parameters invalidate message cache breakpoints, as documented in Anthropic's AWS Bedrock integration guide. Interleaved reasoning amplifies this problem, as thinking blocks can occur between multiple tool calls. Every time you adjust how deeply the model should think, you potentially lose the performance benefits of prompt caching.

Test-time scaling—the broader category encompassing interleaved reasoning—can approach one to two orders of magnitude more compute on hard queries compared to a single inference pass, according to NVIDIA's analysis of AI scaling laws and research on test-time compute methodologies.

Yet for complex agentic tasks—the kind where you want AI to independently research, plan, code, debug, and iterate—the investment pays off. The difference isn't just quantitative but qualitative: these systems can handle genuinely longer workflows without losing coherence.

True Generalization: Beyond Tool Scaling

MiniMax's engineering team discovered something unexpected during M2's development: scaling the number and variety of tools wasn't enough. Their benchmark scores climbed to respectable levels, but when they changed the environment even slightly—swapping to a different agent framework—performance would plummet.

"Agent generalization is not just about adapting to new tools," they write in "Aligning to What? Rethinking Agent Generalization in MiniMax M2." "It's about adapting to perturbations across the model's entire operational space."

Everything can change in a single agent task:

The tool information and available toolset
The system prompt defining the agent's persona and rules
The user prompt and its specific goal
The environment itself (files, codebases, APIs)
The tool responses returned at each step

Their original "tool scaling" approach only addressed the first item. It ignored perturbations in everything else. The solution required building a comprehensive data pipeline designed for full-trajectory generalization—training the model to remain stable against perturbations at every step.

The results proved encouraging. In internal tests, MiniMax threw obscure, "cold-start" frameworks at M2—agent scaffolds they'd barely considered during training. Performance exceeded expectations, with both tool-calling and instruction-following abilities generalizing robustly.

This insight helps explain why interleaved reasoning matters so much for real-world deployment. It's not just about maintaining state; it's about having the reasoning machinery to adapt when the external world doesn't behave as expected.

When 300 Steps Can Still Lead Nowhere

Yet for all this promise, interleaved reasoning systems remain surprisingly fragile. Current implementations require that the entire sequence of consecutive thinking blocks match the outputs generated during the original request. You cannot rearrange or modify the sequence without potentially breaking the reasoning chain.

This brittleness creates practical problems. Want to optimize conversation history by summarizing earlier parts? You can't—not easily—without disrupting the accumulated state. Want to inject additional context mid-conversation? You need to be extremely careful about where and how.

There's also the question of what happens when reasoning leads astray. Say the model mis-hypothesizes that a performance issue stems from database queries in step 5; hundreds of steps later, it's still optimizing SQL and adding indexes, completely missing the real culprit—a memory leak in an unrelated caching layer. The model confidently built an entire strategy on a wrong premise, carrying that error forward.

With traditional models, a wrong answer is just a wrong answer. With interleaved reasoning, a wrong conclusion early can poison everything downstream, compounding the error. Models can get stuck in local optima, convinced by their own earlier reasoning. Self-correction is powerful but not foolproof.

This is why visible thinking matters for debugging. When Claude shows its reasoning trace (described by Anthropic as a "research preview"), developers can spot where the model's understanding diverged from reality. OpenAI took the opposite approach with o-series models, keeping reasoning hidden. Both philosophies have merit; the choice involves tradeoffs around transparency, competitive advantage, and potential misuse.

Human-in-the-loop oversight remains necessary, at least for now. The autonomy is impressive, but it's not infallible.

The Test-Time Scaling Economics

Interleaved reasoning is one manifestation of a broader trend: test-time scaling. For decades, AI development focused almost exclusively on training-time compute—bigger models, more data, larger clusters. But labs are approaching practical limits on model size, both technical and economic.

Test-time scaling offers a new dimension: instead of training a smarter model, you let the model allocate more compute to difficult problems during inference—through extended chain-of-thought generation, self-verification loops, or tree-search over reasoning paths. Recent reasoning models like OpenAI's o-series and DeepSeek-R1 use what researchers call "internal test-time scaling": they're trained to allocate inference compute strategically.

Training Smarter vs. Thinking Longer

Traditional approach: Spend $100M training a massive model that answers quickly

New paradigm: Spend $5M training a capable model; let it think 10× longer on hard problems

The optimal strategy likely involves both—but the balance is shifting toward test-time compute as training costs plateau.

The economics are shifting. When inference compute can substitute for training compute, you get different optimization problems. Do you train one massive model, or a smaller model that thinks longer on hard queries? Do you charge per token, per compute-minute, or per quality-adjusted output?

These questions will shape the industry's near future. Skyler Miao, MiniMax's head of engineering, stated in company communications that they're working with partners to enable interleaved reasoning across all capable models, not just M2. If that vision materializes—if interleaved reasoning becomes a standard feature rather than a proprietary advantage—we'll see an explosion of agentic applications that simply weren't feasible before.

Where This Leaves Autonomous AI

Models that can execute 200–300 tool calls while maintaining coherent reasoning across the entire sequence open up qualitatively new applications:

Autonomous research: An AI that formulates research questions, searches for relevant papers, reads and synthesizes findings, identifies gaps, searches for more specific information, and iterates until it has a comprehensive answer—without human guidance at each step.

Complex debugging: Systems that reproduce a bug, hypothesize about causes, instrument code to test hypotheses, analyze results, refine the hypothesis, and repeat until the root cause is found—all autonomously.

Strategic planning: Models that analyze a business situation, research market conditions, evaluate multiple strategic options, model scenarios, identify risks, propose mitigations, and iterate based on new information discovered during the process.

The common thread: tasks requiring not just intelligence but adaptability. The ability to realize mid-task that your approach isn't working and pivot. To discover information that changes your understanding of the problem. To build complex mental models across dozens or hundreds of steps.

But the infrastructure has to catch up. Until agent frameworks, API middleware, and development tools properly handle preserved reasoning state, many deployments will underperform—not because the models are incapable, but because the plumbing is throwing away the very thing that makes them work.

The question isn't whether AI will become more autonomous. With models now capable of maintaining coherent reasoning across hundreds of tool interactions, autonomy has arrived. The question is whether we can build the infrastructure, safety mechanisms, and development practices to make that autonomy reliable, debuggable, and aligned with human intentions.

The technical building blocks exist. What's missing is the operational discipline, the toolchain maturity, and the human oversight frameworks to deploy these capabilities responsibly. Because when your AI can reason 300 steps ahead, carrying its accumulated understanding all the way through, you'd better make sure it's reasoning about the right things—and that you can see when it's not.

References and Further Reading: