Bridging the Black Box: A Survey on Mechanistic Interpretability in AI

Abstract

Mechanistic interpretability seeks to reverse-engineer the internal logic of neural networks by uncovering human-understandable circuits, algorithms, and causal structures that drive model behavior. Unlike post hoc explanations that describe what models do, this paradigm focuses on why and how they compute, tracing information flow through neurons, attention heads, and activation pathways. This survey provides a high-level synthesis of the field—highlighting its motivation, conceptual foundations, and methodological taxonomy rather than enumerating individual techniques. We organize mechanistic interpretability across three abstraction layers—neurons, circuits, and algorithms—and three evaluation perspectives: behavioral, counterfactual, and causal. We further discuss representative approaches and toolchains that enable structural analysis of modern AI systems, outlining how mechanistic interpretability bridges theoretical insights with practical transparency. Despite rapid progress, challenges persist in scaling these analyses to frontier models, resolving polysemantic representations, and establishing standardized causal benchmarks. By connecting historical evolution, current methodologies, and emerging research directions, this survey aims to provide an integrative framework for understanding how mechanistic interpretability can support transparency, reliability, and governance in large-scale AI.

Abstract

Where mechanistic interpretability sits

Interpretability is not one technique but a family of goals and access patterns. The figure below contrasts three paradigms: post hoc methods that explain a trained black box from the outside (feature attribution, saliency, surrogates), explainable-by-design models whose structure is intended to be read directly (trees, rules, symbolic components), and mechanistic interpretability—often grouped under post hoc in surveys, but requiring white-box access to recover internal structure such as circuits, subspaces, and causal pathways.

Taxonomy of interpretability paradigms: post hoc, explainable-by-design, and mechanistic

Figure 1: Comparison of access (black-box vs white-box), where explanations come from (external vs inherent vs internal), example techniques, and trade-offs between causal insight and scalability.

Layers and lenses in the survey

The ACM Computing Surveys article synthesizes mechanistic interpretability at three abstraction layers—neurons, circuits, and algorithms—and discusses evaluation from behavioral, counterfactual, and causal angles. That structure helps readers move from local probes (what activates?) to compositional hypotheses (which subgraph implements a subtask?) without losing sight of validation discipline.

Tooling, scale, and open challenges

Modern work combines activation patching, causal tracing, sparse autoencoders, and related toolchains to make internal features more legible—especially in transformers and LLMs. The survey emphasizes that progress must be weighed against scaling to frontier models, polysemantic units, and the need for standardized causal and robustness benchmarks so claims about “what the model is doing” can be compared across studies.

Bridging the Black Box: A Survey on Mechanistic Interpretability in AI

Abstract

Bridging the Black Box: A Survey on Mechanistic Interpretability in AI

Abstract

Where mechanistic interpretability sits

Layers and lenses in the survey

Tooling, scale, and open challenges

Resources

Where mechanistic interpretability sits

Layers and lenses in the survey

Tooling, scale, and open challenges

Resources