AI Doesn’t Calculate—It Imitates: How Language Models “Solve” Math Without Algorithms

Spanish

^{Jcomp/Freepik}

Redacción HC
31/03/2025

Large Language Models (LLMs) like GPT-4 or LLaMA have amazed the world with their ability to perform seemingly intelligent tasks, including basic arithmetic. But do they actually understand math, or are they just stringing together clever shortcuts?

A provocative new study—Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics, by researchers from Technion and Northeastern University—dives deep into the internal workings of LLMs and finds something surprising: these models don’t use algorithms to solve arithmetic. Instead, they rely on a scattered toolkit of simple heuristics—rules of thumb encoded in specific neurons.

This discovery has deep implications for how we interpret, design, and regulate AI systems, especially as we increasingly trust them with educational, scientific, and technical tasks.

Inside the Black Box: Heuristics, Not Algorithms

The study explored how LLMs perform elementary addition and subtraction, focusing on models like LLaMA 3–8B, GPT-J, and Pythia. The researchers uncovered a “circuit”—a small set of neurons responsible for arithmetic behavior—and examined it using causal interventions (a technique known as “patching” or component ablation).

Rather than finding a clean, algorithmic process, they discovered the opposite: each neuron encodes a simple rule, like “if the sum is around 70, prefer this digit.” Together, these neurons act like a bag of heuristics, loosely collaborating to produce the final answer.

This mechanism is fragile. When researchers selectively removed even one heuristic neuron, accuracy on arithmetic tasks dropped by up to 29%—a clear sign that these rules are essential, but brittle.

What the Findings Reveal About AI Reasoning

1. Not General Algorithms, but Task-Specific Tricks

The models didn’t learn generalizable arithmetic procedures. Instead, they pieced together task-specific patterns—ways to match input-output examples seen during training.

“They’re not calculating; they’re imitating,” the authors note. “Each neuron applies a small heuristic, and the model layers them like duct tape until the result looks right.”

2. Emergence During Training

Heuristics emerged early in the training of the Pythia-6.9B model and solidified over time. Yet, even in later training stages, these heuristics didn’t evolve into robust algorithms.

This suggests a fundamental limitation: current LLMs lack an inductive bias toward algorithmic structure. They’re pattern recognizers, not mathematicians.

3. Failure Modes Are Predictable—But Only If You Know the Internals

When an answer requires a rule the model hasn’t learned—or if multiple heuristics activate incorrectly—the model fails. This explains erratic errors in LLMs: they’re not bugs, but missing or misfiring heuristics.

Why This Matters: Implications for Design and Trust

AI Interpretability and Reliability

This study provides crucial insight for policymakers, AI ethicists, and engineers. If a model “gets the right answer,” it doesn’t mean it understands the problem. Its success may hinge on a few arbitrary neuron-level heuristics.

In high-stakes applications—like medical calculations, financial modeling, or education—this could lead to silent failures that users can't easily predict or detect.

Building Better AI: Time for Hybrid Models?

Rather than expecting LLMs to "learn math" on their own, the authors suggest blending them with algorithmic components—for example, integrating calculators or symbolic computation modules.

Hybrid architectures could bring the best of both worlds: LLM flexibility with algorithmic accuracy.

Responsible Regulation

For regulators, the takeaway is clear: models that appear competent may be fundamentally non-deterministic in how they arrive at answers. Auditing such systems requires understanding their internal causal mechanisms, not just performance metrics.

Educational and Social Implications

For educators and technologists developing AI-powered learning tools, this study is a red flag. If a virtual tutor can’t reliably calculate, it shouldn’t be trusted to grade student math work or teach concepts.

Similarly, for the Global South—where LLMs are increasingly being explored for automated education and public services—this finding warns against over-reliance without robust validation.

What’s Next? Fixing the Math Gap

The authors recommend several paths forward:

Augment LLMs with symbolic computation modules to guarantee correctness in critical tasks.
Develop architectures that natively learn or embed algorithms, rather than heuristics.
Build better interpretability tools to map and monitor internal reasoning circuits, especially for numerical tasks.

“This isn’t just a bug—it’s a feature of current LLMs,” says Yonatan Belinkov, co-author. “If we want real reasoning, we need models that go beyond patterns.”

Conclusion: Not Magic, Just Tricks

The study dismantles the myth that language models “understand” math. They don’t compute—they approximate. They don’t reason—they cobble together rules that worked before.

This doesn’t make them useless, but it does make them unreliable—especially in edge cases or unfamiliar problems. And that has major implications for how we build and use AI systems going forward.

Topics of interest

Technology

Referencia: Nikankin Y, Reusch A, Mueller A, Belinkov Y. Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics. arXiv preprint. 2025 May 20. doi:10.48550/arXiv.2410.21272

License