The Illusion of Thinking: What Apple’s Latest Study Reveals About the Limits of AI Reasoning


Español
Biorobot
Biorobot
Freepik

Redacción HC
11/09/2025

Artificial intelligence (AI) has long been celebrated for its ability to mimic human reasoning, especially with the rise of so-called Large Reasoning Models (LRMs). These models do more than provide final answers: they generate detailed “chains of thought” that appear to show a step-by-step reasoning process. But is this truly thinking—or just an illusion of thought?

A recent paper from Apple Machine Learning Research, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, takes a fresh approach to this debate. Published on arXiv in June 2025, the study systematically tests how reasoning models behave when faced with puzzles of increasing complexity. The results raise important questions about what AI can—and cannot—do when it comes to genuine problem-solving.

Why Reasoning Models Are Under the Microscope

Modern LRMs are designed to simulate human-like reasoning by producing long, explicit thought traces. This makes them especially appealing in sensitive domains such as medicine, law, or education, where transparency is critical. However, as the Apple researchers point out, “standard benchmarks often measure only the final answer, leaving unanswered whether models are genuinely reasoning or merely exploiting learned correlations.”

This tension—between apparent reasoning and actual reasoning—is the starting point for Apple’s investigation. The central question: What happens when these models are pushed beyond simple tasks into problems of escalating compositional complexity?

A Puzzle-Based Methodology

To minimize bias and avoid benchmark contamination, the Apple team designed controlled puzzle environments where complexity could be increased step by step. Think of puzzles like the Tower of Hanoi: with every added disk, the number of moves and decision branches rises dramatically.

In their experiments, researchers compared LRMs against traditional large language models (LLMs) under the same computational budget. They measured two main outputs:

  1. Final accuracy—whether the model reached the correct solution.
  2. Internal reasoning traces—the step-by-step “thoughts” generated before producing an answer.

The methodology is notable because it goes beyond black-box evaluation. By observing not just the outcome but the reasoning process, the researchers were able to identify systematic strengths and weaknesses that would be invisible in conventional benchmarks.

Key Findings: Strengths and Limitations

The results reveal a clear three-phase pattern in how reasoning models behave as problems grow harder:

  • Low Complexity
    In simple tasks, traditional LLMs sometimes matched—or even outperformed—LRMs. According to the study, “extra reasoning steps can introduce errors, leading to worse performance in trivial cases.” In other words, overthinking hurts.
  • Medium Complexity
    This is where LRMs shine. Their extended reasoning traces enable them to navigate intermediate tasks more reliably than LLMs, proving the value of their design.
  • High Complexity
    At a certain threshold, both LRMs and LLMs collapse. Accuracy plummets, and strikingly, LRMs begin to shorten their reasoning traces—even when plenty of tokens are available. Instead of working harder, the models seem to “give up” once the challenge exceeds their operational limits.

The authors describe this as a “counterintuitive scaling limit.” Up to a point, increasing complexity produces more reasoning effort. Beyond that point, reasoning effort actually decreases.

This collapse highlights a fundamental fragility: LRMs do not apply exact algorithms or reliably reproduce computational steps. Instead, they rely on heuristics that break down in the face of deep compositionality.

Practical and Societal Implications

The findings carry significant implications for policymakers, developers, and industries considering AI deployment. The lesson is twofold:

  1. Do not assume general reasoning ability just because a model produces a chain of thought. Explanations may look convincing without guaranteeing correctness.
  2. Deploy selectively: LRMs add real value in intermediate tasks but cannot be trusted as problem complexity grows.

For high-stakes domains such as medical diagnosis, legal analysis, or resource planning, relying solely on reasoning traces could be dangerous. Instead, the authors recommend hybrid approaches that combine statistical models with exact algorithms or symbolic reasoning systems.

This caution is particularly relevant in regions like Latin America, where LRMs could be deployed in agriculture support systems or public service platforms. Controlled pilots, rather than mass rollouts, are the safest way to test whether these tools genuinely add value in complex real-world environments.

Conclusion

Apple’s study underscores that what looks like “thinking” in AI may not be thinking at all. LRMs provide clear benefits in moderate-complexity tasks, but they falter when problems become too difficult—sometimes even retreating from reasoning altogether.

The broader message is simple yet profound: we should not mistake the appearance of thought for actual reasoning. Understanding this distinction is crucial for safe, effective, and trustworthy AI deployment.


Topics of interest

Technology

Reference: Shojaee P, Mirzadeh I, Alizadeh K, Horton M, Bengio S, Farajtabar M. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [Internet]. arXiv; 2025. Available on: https://doi.org/10.48550/arXiv.2506.06941

License

Creative Commons license 4.0. Read our license terms and conditions
Beneficios de publicar

Latest Updates

Figure.
When Animals Disappear, Forests Lose Their Power to Capture Carbon
Figure.
Sixteen Weeks That Moved Needles: How Nutrition Education Improved Diet and Child Hemoglobin in a Peruvian Amazon Community
Figure.
When Plastics Meet Pesticides: How Nanoplastics Boost Contaminant Uptake in Lettuce