Redacción HC
22/05/2025
Language models are rapidly shaping how we interact with technology—from chatbots and legal assistants to tools in medicine and education. But one fundamental question persists: how do these models actually compute answers? While techniques like attention heatmaps and activation visualizations offer surface-level insight, they often miss the deeper, causal dynamics driving a model's response.
A new study by researchers at Anthropic, published in Transformer-Circuits.pub, introduces a novel method for mapping causal computation graphs inside large language models (LLMs). This method, called Circuit Tracing, offers a powerful framework to understand how information flows through the layers of a model, potentially opening the door to auditable and explainable AI.
Despite advances in AI interpretability, most current methods fall short of showing how specific inputs translate into outputs via a clear, step-by-step process. The challenge is not just theoretical. If models are to be trusted in high-stakes applications—like finance, justice, or healthcare—they must be explainable, debuggable, and accountable.
The central research question posed by the authors is: Can we reconstruct a "causal attribution graph" that clearly maps the computational path a model takes to arrive at a given output?
To tackle this question, the research team constructed a simplified but interpretable proxy model called a Cross-Layer Transcoder (CLT). This neural network is trained to replicate the output of an LLM's multilayer perceptrons (MLPs), but with a crucial twist: its internal features (nodes) are designed to be understandable by humans.
Here's how the method unfolds:
In tasks like addition or acronym decoding, the CLT discovered nodes that align with recognizable cognitive functions: recognizing digit placement, tracking capitalization, or mapping characters. These nodes could be grouped into "supernodes" that represent modular sub-tasks, such as processing tens vs. units in a numerical operation.
"Inhibiting a single node corresponding to the unit digit turned the model's answer from 95 to 98," the authors report—clear evidence of causal influence.
For arithmetic prompts, the graphs revealed distinct paths for processing tens and units, often passing through lookup-like nodes that serve as internal references. These pathways resemble modular computation circuits, suggesting that LLMs may build reusable routines to solve structured tasks.
The team introduced three key metrics:
On simple tasks, all three metrics performed well. But fidelity decreased on more complex or out-of-distribution prompts, where the CLT's approximations struggled to capture the full behavior of the original model.
This framework could become the backbone for AI auditing tools. Regulatory bodies or developers could use attribution graphs to verify that a model's decisions stem from logical computation, not spurious correlations or hidden biases.
"Graph tracing can expose whether a model's answer reflects real reasoning or superficial pattern matching," explains lead author Joshua Batson.
In fields like healthcare or criminal justice, showing that a model relies on identifiable patterns rather than black-box correlations is essential. Tools like Circuit Tracing make it possible to intervene, test, and correct AI behaviors before they reach the user.
The authors suggest:
These efforts align with a growing movement to move from interpretability to inspectability—building models that can be rigorously analyzed, modified, and explained.
Circuit Tracing represents a major step forward in AI transparency. By reconstructing and validating causal graphs inside language models, researchers are moving from vague visualizations to structured, testable mechanisms. This approach not only advances the science of interpretability but lays the groundwork for trustworthy, verifiable AI systems.
As AI becomes more embedded in our daily lives, understanding how and why it arrives at its decisions will be just as important as the decisions themselves. Tools like the CLT and causal graphs are showing us that the mind of a model—once considered inscrutable—can, in fact, be mapped.
Referencia: DOI: Not applicable for this format. REFERENCIA: Ameisen E, Lindsey J, Pearce A, et al. Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer-Circuits.pub. 2025 Mar 27. Available from: https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Topics of interest
TechnologyReferencia: Ameisen E, Lindsey J, Pearce A, et al. Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer-Circuits.pub. 2025 Mar 27. Available from: https://transformer-circuits.pub/2025/attribution-graphs/methods.html
![]()