Large Language Models (LLMs) like GPT-4 or LLaMA have amazed the world with their ability to perform seemingly intelligent tasks, including basic arithmetic. But do they actually understand math, or are they just stringing together clever shortcuts?
A provocative new study—Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics, by researchers from Technion and Northeastern University—dives deep into the internal workings of LLMs and finds something surprising: these models don’t use algorithms to solve arithmetic. Instead, they rely on a scattered toolkit of simple heuristics—rules of thumb encoded in specific neurons.
This discovery has deep implications for how we interpret, design, and regulate AI systems, especially as we increasingly trust them with educational, scientific, and technical tasks.
The study explored how LLMs perform elementary addition and subtraction, focusing on models like LLaMA 3–8B, GPT-J, and Pythia. The researchers uncovered a “circuit”—a small set of neurons responsible for arithmetic behavior—and examined it using causal interventions (a technique known as “patching” or component ablation).
Rather than finding a clean, algorithmic process, they discovered the opposite: each neuron encodes a simple rule, like “if the sum is around 70, prefer this digit.” Together, these neurons act like a bag of heuristics, loosely collaborating to produce the final answer.
This mechanism is fragile. When researchers selectively removed even one heuristic neuron, accuracy on arithmetic tasks dropped by up to 29%—a clear sign that these rules are essential, but brittle.
The models didn’t learn generalizable arithmetic procedures. Instead, they pieced together task-specific patterns—ways to match input-output examples seen during training.
“They’re not calculating; they’re imitating,” the authors note. “Each neuron applies a small heuristic, and the model layers them like duct tape until the result looks right.”
Heuristics emerged early in the training of the Pythia-6.9B model and solidified over time. Yet, even in later training stages, these heuristics didn’t evolve into robust algorithms.
This suggests a fundamental limitation: current LLMs lack an inductive bias toward algorithmic structure. They’re pattern recognizers, not mathematicians.
When an answer requires a rule the model hasn’t learned—or if multiple heuristics activate incorrectly—the model fails. This explains erratic errors in LLMs: they’re not bugs, but missing or misfiring heuristics.
This study provides crucial insight for policymakers, AI ethicists, and engineers. If a model “gets the right answer,” it doesn’t mean it understands the problem. Its success may hinge on a few arbitrary neuron-level heuristics.
In high-stakes applications—like medical calculations, financial modeling, or education—this could lead to silent failures that users can't easily predict or detect.
Rather than expecting LLMs to "learn math" on their own, the authors suggest blending them with algorithmic components—for example, integrating calculators or symbolic computation modules.
Hybrid architectures could bring the best of both worlds: LLM flexibility with algorithmic accuracy.
For regulators, the takeaway is clear: models that appear competent may be fundamentally non-deterministic in how they arrive at answers. Auditing such systems requires understanding their internal causal mechanisms, not just performance metrics.
For educators and technologists developing AI-powered learning tools, this study is a red flag. If a virtual tutor can’t reliably calculate, it shouldn’t be trusted to grade student math work or teach concepts.
Similarly, for the Global South—where LLMs are increasingly being explored for automated education and public services—this finding warns against over-reliance without robust validation.
The authors recommend several paths forward:
“This isn’t just a bug—it’s a feature of current LLMs,” says Yonatan Belinkov, co-author. “If we want real reasoning, we need models that go beyond patterns.”
The study dismantles the myth that language models “understand” math. They don’t compute—they approximate. They don’t reason—they cobble together rules that worked before.
This doesn’t make them useless, but it does make them unreliable—especially in edge cases or unfamiliar problems. And that has major implications for how we build and use AI systems going forward.
Topics of interest
TechnologyReferencia: Nikankin Y, Reusch A, Mueller A, Belinkov Y. Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics. arXiv preprint. 2025 May 20. doi:10.48550/arXiv.2410.21272