The latest generation of AI models from OpenAI, Anthropic, and others promise something revolutionary: machines that can "think" before they answer. These Large Reasoning Models (LRMs) generate detailed chains of thought, self-reflect on their reasoning, and supposedly tackle complex problems better than their predecessors. But new research from Apple throws cold water on these claims, revealing that when problems get genuinely difficult, these models don't just struggle—they essentially give up.
The Puzzle Test That Broke the Models
Apple researchers put frontier reasoning models through their paces using classic logic puzzles—Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World. Unlike typical benchmarks that might be contaminated by training data, these puzzles offer something crucial: controllable complexity. You can make a Tower of Hanoi puzzle harder simply by adding more disks, allowing researchers to systematically probe where reasoning breaks down.
The results paint a fascinating picture of three distinct performance regimes. In simple tasks, standard LLMs actually outperform their "thinking" cousins—they're faster, more efficient, and just as accurate. It's only at medium complexity where reasoning models shine, leveraging their extended thinking to navigate trickier problems. But push the difficulty higher, and both model types collapse entirely, achieving zero percent accuracy.
What's particularly damning is that this isn't a graceful degradation. The models don't slowly get worse—they hit a wall and stop functioning entirely. Even more puzzling: as problems approach this critical threshold, reasoning models actually reduce their thinking effort, using fewer reasoning tokens despite having ample computational budget available.
The Overthinking Paradox
The study's deep dive into reasoning traces reveals another counterintuitive finding. On easier problems, these models often find the correct answer early in their thinking process but then continue exploring, frequently talking themselves into wrong solutions—a phenomenon researchers dub "overthinking."
Consider Claude 3.7 Sonnet tackling a simple Tower of Hanoi puzzle. The model might correctly identify the solution within the first 20% of its thinking, then spend the remaining 80% second-guessing itself and exploring incorrect alternatives. This isn't efficient reasoning; it's computational wheel-spinning.
At medium complexity, the pattern reverses. Models explore many wrong paths before eventually finding the right answer, suggesting their extended thinking does serve a purpose—but only within a narrow band of problem difficulty. Once complexity exceeds a model-specific threshold, the thinking process produces no correct solutions at all, regardless of how long the model "thinks."
Algorithm Execution: A Fundamental Failure
Perhaps the most striking finding concerns what should be the easiest scenario: following explicit instructions. Researchers provided models with complete, step-by-step algorithms for solving puzzles. Surely, executing a given algorithm requires less sophisticated reasoning than discovering a solution from scratch?
Apparently not. Even with the solution method handed to them on a silver platter, models failed at exactly the same complexity levels. This suggests the limitation isn't in problem-solving strategy or search—it's in the fundamental ability to maintain logical consistency through extended sequences of operations.
Training Data Can't Explain Everything
The study uncovered telling disparities between puzzle types that hint at training data effects. Claude 3.7 Sonnet could handle Tower of Hanoi puzzles requiring over 100 moves but failed on River Crossing puzzles needing just 11 moves. The likely culprit? Tower of Hanoi is a computer science staple that appears frequently online, while complex river-crossing variants are comparatively rare.
This raises uncomfortable questions about what these models have actually learned. Are they reasoning through problems, or are they sophisticated pattern-matchers that break down when pushed beyond their training distribution?
The Scaling Wall
For years, the AI industry has operated on a simple principle: more compute, more data, bigger models, better performance. But as traditional scaling shows diminishing returns, companies like OpenAI have bet heavily on "inference-time compute"—letting models think longer to solve harder problems.
Apple's research suggests this approach faces fundamental limits. Current reasoning models don't develop what the researchers call "generalizable problem-solving capabilities." Instead of learning robust strategies that scale with complexity, they've learned to navigate a specific range of problems through pattern matching and retrieval.
The counterintuitive reduction in thinking effort at high complexity levels is particularly telling. It's as if the models recognize they're out of their depth and don't even try. This behavior, consistent across different model families, points to architectural limitations rather than training deficiencies.
What This Means for AI's Future
These findings arrive at a crucial moment. As traditional model scaling plateaus, reasoning methods represent one of the few remaining avenues for capability improvements. If Apple's results generalize beyond logic puzzles—and there's reason to believe they might—the implications are sobering.
The good news is that within their comfort zone, reasoning models do provide value. For problems of moderate complexity, the extended thinking genuinely helps. The bad news is that this comfort zone appears frustratingly narrow, and we don't yet know how to expand it.
The research also highlights the danger of anthropomorphizing AI systems. When we see models producing "chains of thought," it's tempting to imagine them reasoning like humans. But as the researchers note, these outputs might be better understood as "statistical calculations" dressed up in human-like language.
The Path Forward
So where does this leave us? The Apple team suggests that achieving robust machine reasoning may require fundamental architectural innovations, not just incremental improvements to current approaches. The consistent failure modes across different model families indicate we're hitting inherent limitations of the transformer-based architecture that underlies modern LLMs.
For practitioners, the message is clear: reasoning models can be valuable tools, but their limitations are real and predictable. Understanding where they excel (moderate complexity, familiar problem types) and where they fail (high complexity, novel challenges) is crucial for appropriate deployment.
The study serves as a reality check for an industry prone to hype. Yes, we've made remarkable progress in AI capabilities. But genuine reasoning—the kind that can tackle arbitrary problems through logical thinking rather than pattern matching—remains frustratingly out of reach. These models don't think; they perform increasingly sophisticated approximations of thinking that work until they don't.
As we push toward artificial general intelligence, studies like this remind us how far we still have to go. The illusion of thinking is powerful, but illusions have a way of shattering when tested against reality. And reality, as these puzzles demonstrate, can be puzzlingly hard.
Unlock the Future of Business with AI
Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.