Redacción HC
11/09/2025
Artificial intelligence (AI) has long been celebrated for its ability to mimic human reasoning, especially with the rise of so-called Large Reasoning Models (LRMs). These models do more than provide final answers: they generate detailed “chains of thought” that appear to show a step-by-step reasoning process. But is this truly thinking—or just an illusion of thought?
A recent paper from Apple Machine Learning Research, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, takes a fresh approach to this debate. Published on arXiv in June 2025, the study systematically tests how reasoning models behave when faced with puzzles of increasing complexity. The results raise important questions about what AI can—and cannot—do when it comes to genuine problem-solving.
Modern LRMs are designed to simulate human-like reasoning by producing long, explicit thought traces. This makes them especially appealing in sensitive domains such as medicine, law, or education, where transparency is critical. However, as the Apple researchers point out, “standard benchmarks often measure only the final answer, leaving unanswered whether models are genuinely reasoning or merely exploiting learned correlations.”
This tension—between apparent reasoning and actual reasoning—is the starting point for Apple’s investigation. The central question: What happens when these models are pushed beyond simple tasks into problems of escalating compositional complexity?
To minimize bias and avoid benchmark contamination, the Apple team designed controlled puzzle environments where complexity could be increased step by step. Think of puzzles like the Tower of Hanoi: with every added disk, the number of moves and decision branches rises dramatically.
In their experiments, researchers compared LRMs against traditional large language models (LLMs) under the same computational budget. They measured two main outputs:
The methodology is notable because it goes beyond black-box evaluation. By observing not just the outcome but the reasoning process, the researchers were able to identify systematic strengths and weaknesses that would be invisible in conventional benchmarks.
The results reveal a clear three-phase pattern in how reasoning models behave as problems grow harder:
The authors describe this as a “counterintuitive scaling limit.” Up to a point, increasing complexity produces more reasoning effort. Beyond that point, reasoning effort actually decreases.
This collapse highlights a fundamental fragility: LRMs do not apply exact algorithms or reliably reproduce computational steps. Instead, they rely on heuristics that break down in the face of deep compositionality.
The findings carry significant implications for policymakers, developers, and industries considering AI deployment. The lesson is twofold:
For high-stakes domains such as medical diagnosis, legal analysis, or resource planning, relying solely on reasoning traces could be dangerous. Instead, the authors recommend hybrid approaches that combine statistical models with exact algorithms or symbolic reasoning systems.
This caution is particularly relevant in regions like Latin America, where LRMs could be deployed in agriculture support systems or public service platforms. Controlled pilots, rather than mass rollouts, are the safest way to test whether these tools genuinely add value in complex real-world environments.
Apple’s study underscores that what looks like “thinking” in AI may not be thinking at all. LRMs provide clear benefits in moderate-complexity tasks, but they falter when problems become too difficult—sometimes even retreating from reasoning altogether.
The broader message is simple yet profound: we should not mistake the appearance of thought for actual reasoning. Understanding this distinction is crucial for safe, effective, and trustworthy AI deployment.
Topics of interest
TechnologyReference: Shojaee P, Mirzadeh I, Alizadeh K, Horton M, Bengio S, Farajtabar M. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [Internet]. arXiv; 2025. Available on: https://doi.org/10.48550/arXiv.2506.06941
![]()