The Challenges of Superposition in LLMs

Part I: Defining the Phenomenon

The internal workings of large language models (LLMs) present one of the most significant “black box” problems in modern science. As these models grow in capability, the need to understand their decision-making processes becomes paramount for safety, reliability, and scientific advancement. Central to this challenge is the concept of superposition, a fundamental principle of neural representation that is both a source of computational efficiency and a primary obstacle to mechanistic interpretability. This report provides a comprehensive analysis of superposition in LLMs, synthesizing foundational theory, seminal research, and the latest findings from the interpretability frontier. It will deconstruct the mechanisms of superposition, explore its systemic implications for model behavior, detail the tools used for its analysis, and outline the open problems that define the future of this critical research area.

Section 1: The Principle of Superposition in Neural Representations

At its core, superposition describes a representational strategy employed by neural networks to overcome inherent architectural limitations. It is a phenomenon where the abstract concepts a model has learned are not neatly compartmentalized but are instead compressed and overlapped within the network’s finite dimensional space. Understanding this principle is the first step toward reverse-engineering the complex algorithms embedded within LLMs. 🧠

1.1 Conceptual Framework: Representing More Features Than Dimensions

Superposition is formally defined as the phenomenon where a neural network represents a set of m features within a representational space of d dimensions, where the number of features is significantly greater than the number of dimensions, or m > d [1]. A ”feature” can be considered any property of the input that is recognizable and meaningful, such as the semantic concept of a “cat,” the syntactic property of a word being a verb, or a specific pattern in source code [1]. When a model must learn to recognize millions of such features but has a limited number of neurons and dimensions in its hidden layers, it cannot afford to dedicate a unique neuron to each feature. Superposition is the network’s solution to this storage problem [2]. It is a fundamental mechanism for achieving computational efficiency in large-scale networks, allowing them to pack an immense amount of information into a compact parameter space [3].

This process is best understood as a form of learned, lossy compression [2]. Unlike standard compression algorithms like ZIP or JPEG, which use explicit, predefined rules for packing and unpacking data, superposition is a compression scheme that the neural network discovers on its own through training. The network learns to pack features together in a way that preserves the information it needs to perform its task while fitting within its architectural constraints. This compression, however, comes at a cost: ”interference” between the overlapping features, which requires nonlinear filtering (typically via activation functions like ReLU) to resolve [5]. The central challenge for interpretability is that there is no clean, human-readable “decompression algorithm.” When multiple features are superimposed in the same neurons, separating them becomes a complex inference problem, akin to trying to extract individual items from a brilliantly space-efficient but unlabeled storage system [2].

1.2 The Duality of Superposition and Polysemanticity: Cause and Effect

The concepts of superposition and polysemanticity are deeply intertwined, representing the underlying mechanism and its observable consequence, respectively. While superposition describes the network’s strategy of overlapping feature representations in a high-dimensional space, polysemanticity is the manifestation of this strategy at the level of individual neurons. A polysemantic neuron is one that activates for a set of multiple, often conceptually unrelated, features [7]. For example, early interpretability work on the InceptionV1 image model found a single neuron that responded to cat faces, the fronts of cars, and cat legs—a classic example of polysemanticity [1].

Superposition is therefore the cause, and polysemanticity is the effect. Because features are stored in a superposed, overlapping fashion, any given neuron will participate in the representation of many different features. When one of these features is present in the input, the neuron will fire, making it appear polysemantic to an observer trying to assign it a single, ”monosemantic” meaning [4]. This distinction is not merely semantic; it is critical for guiding research. The central problem for interpretability is not polysemanticity itself, but the underlying superposition that causes it. Consequently, the most promising tools for making models interpretable, such as Sparse Autoencoders, are designed not just to identify polysemantic neurons but to resolve the underlying superposition by “decompressing” the features into a space where they are no longer overlapped.

1.3 Distinguishing Neural Superposition from Quantum Superposition: A Necessary Clarification

The term “superposition” is famously borrowed from quantum mechanics, and this has led to significant conceptual confusion and “semantic inflation” [12]. It is crucial to distinguish between the two concepts, as there is currently no theoretical or empirical justification for the assertion that LLMs exhibit emergent quantum mechanical behavior [13].

In quantum mechanics, superposition is a fundamental principle where a quantum system, such as a qubit, can exist in a linear combination of multiple distinct basis states (e.g., both 0 and 1) simultaneously [15]. This quantum parallelism, along with entanglement, allows quantum computers to process an exponential number of possibilities concurrently, providing their computational power [18]. When a quantum system is measured, its state “collapses” into one of the definite basis states with a certain probability [16].

In the context of neural networks and LLMs, “superposition” refers to the classical mathematical superposition principle, which simply describes a linear combination of vectors [14]. Features are represented as vectors (i.e., directions in the high-dimensional activation space of the network). When multiple features co-occur in an input, the network’s representation of this combination is the linear sum of the vectors corresponding to the individual features [21]. The term y = sum(f_i * x_i) captures this principle, where y is the resulting activation vector, f_i are the feature vectors, and x_i are scalars indicating the presence or intensity of each feature. While some researchers have found it useful to employ quantum-inspired formalisms as a tool for modeling semantic spaces in LLMs, this is an analogy and does not imply the systems are physically quantum [13]. This report will strictly adhere to the classical, linear-algebraic definition of superposition as it applies to neural networks.

1.4 Superposition within the Spectrum of Distributed Representations

The idea that concepts are represented by patterns of activity across many neurons, known as distributed representations, is a foundational principle of connectionist AI dating back to the 1980s [21]. Superposition is a modern, specific, and particularly efficient form of distributed representation. Research has helped delineate a spectrum of coding schemes, each with distinct trade-offs.

A key development in this area is the realization that two primary strategies for distributed representation, composition and superposition, are fundamentally in tension with one another [23]. Compositional codes excel at representing novel combinations of known features, which is key to generalization. For example, if a model has separate features for “green” and “square,” it can compose them to understand a “green square,” even if it has never seen one before, by generalizing from its knowledge of green circles and red squares. This allows for non-local generalization and is highly extensible. Superpositional codes, in contrast, excel at packing a massive number of unrelated features into a small dimensional space. They are maximally efficient for storage but limit the ability to compose features.

This tension reveals a deep organizing principle of neural representations. A representational space cannot be optimized for both maximal compositionality and maximal superposition simultaneously. The key factor that mediates this trade-off is feature sparsity [23]. When the features in the data are sparse—meaning that only a small number of them are active at any given time and complex combinations are rare—the network can employ a hybrid strategy. It can use a compositional code for the features it knows how to combine, and then use the “holes” or low-probability regions in this compositional space to store other, unrelated features in superposition. This explains how a network can be both generalizable and incredibly information-dense. It suggests that different layers or components of an LLM might adopt different representational strategies based on the statistical properties of the information they process.

The following table provides a structured comparison of these coding schemes, clarifying the unique properties and trade-offs of superposition.

Table 1: A Comparative Taxonomy of Neural Representations

Representation Type	Core Principle	Neuron Usage	Capacity	Generalization Capability	Interpretability	Key Trade-off
Local Code (“Grandmother Cell”)	One neuron represents one distinct concept.	Inefficient: requires one neuron per feature.	Low: `n` features require `n` neurons.	Poor: cannot represent novel combinations of features.	High: each neuron has a clear, singular meaning.	Simplicity vs. Scalability & Generalization.
Compositional Code (Sparse Distributed)	Concepts are represented by the composition of independent, reusable features (e.g., color + shape).	More efficient than local codes. `n` features can represent `exp(n)` combinations [23].	High (for combinations).	High: enables non-local generalization to unseen combinations.	Moderate: features are interpretable, but their combinations can be complex.	Generalization vs. Representational Density.
Superposition (Dense Distributed)	More features are represented than neurons (`m > d`) by mapping them to nearly-orthogonal directions in activation space.	Maximally efficient: can store `exp(d)` features in `d` dimensions if they are non-composing [23].	Very High (for unrelated features).	Poor (for composition): limits the ability to combine features arbitrarily.	Low: neurons are highly polysemantic, requiring tools like SAEs to disentangle.	Density vs. Interference & Compositionality.

Part II: Theoretical Foundations and Seminal Research

The emergence of superposition in neural networks is not an accident but a predictable outcome of training models with specific properties on data with particular statistical structures. Foundational research, particularly through the use of “toy models,” has illuminated the core mechanics of why and how superposition arises, revealing a phenomenon governed by principles of optimization, geometry, and information theory. ⚙️

Section 2: The Mechanics of Superposition: Why and How It Emerges

Superposition is a learned strategy that balances the need to represent a vast number of features against the constraints of a finite architecture. This balancing act is guided by several key factors.

2.1 The Critical Role of Feature Sparsity and Importance

The primary condition that makes superposition a viable and efficient representational strategy is feature sparsity [5]. Sparsity means that, for any given input, only a small subset of the total possible features are active or non-zero [1]. When features are dense, meaning many of them co-occur frequently, a model trained to compress and reconstruct them will typically learn an orthogonal basis for only the most important features and ignore the rest. The interference from representing many dense features simultaneously is too high to be worthwhile. However, when features are sparse, the probability of any two given features being active at the same time is low. This drastically reduces the expected interference, making it an acceptable cost for the benefit of representing many more features than the model has dimensions [6].

The network’s compression scheme is further guided by feature importance. In typical training setups, the model’s objective is to minimize a reconstruction loss, which is often weighted by how important each feature is for the model’s downstream task [25]. This means that features that are more critical for reducing the overall loss will be allocated more representational “fidelity.” The model will learn an embedding that represents these important features with minimal interference, potentially arranging their corresponding vectors to be more orthogonal to other features. In contrast, for less important features, the model may tolerate a higher degree of interference to save representational capacity [11].

Finally, non-linearity is a necessary condition for superposition to provide its compressive benefits. Non-linear activation functions, such as the Rectified Linear Unit (ReLU), are essential for filtering the interference that arises from overlapping features [1]. A network without non-linearities would behave as a simple linear transformation, where features would remain linearly separable and could not be compressed into a lower-dimensional space via superposition. The ReLU function, f(x) = max(0, x), is particularly effective because it can clip negative interference to zero, effectively making some interference “free” under certain conditions (e.g., when only one feature in a superposed set is active) [25].

2.2 Mathematical and Geometric Intuitions

The core mathematical idea behind superposition is the representation of features as nearly orthogonal directions in the high-dimensional vector space of neuron activations [6]. Perfect orthogonality would mean no interference, but would require one dimension per feature. By relaxing this constraint to “nearly orthogonal,” the network can fit more feature vectors into the space than there are dimensions. The “cost” is that the activation of one feature will have a small projection onto the directions of other features, creating the “noise” or “interference” that the network must learn to tolerate or filter out.

Remarkably, the process of learning these nearly orthogonal representations via gradient descent does not lead to messy or random arrangements. Seminal research from Anthropic revealed that these optimization problems converge to highly structured and often elegant geometric solutions. In their toy models, when embedding n features into a two-dimensional hidden space, the learned feature vectors (i.e., the columns of the encoding weight matrix) consistently arranged themselves into the vertices of a regular n-gon—a type of uniform polytope [11]. This surprising discovery demonstrated that the solutions found by neural networks are not arbitrary but are principled geometric configurations that optimally trade off representational fidelity and interference.

This phenomenon is deeply connected to established mathematical principles. From one perspective, superposition can be viewed as a learned version of a Bloom filter, a probabilistic data structure that efficiently tests whether an element is a member of a set, with a possibility of false positives but no false negatives [4]. From another, it is a practical application of the Johnson-Lindenstrauss Lemma, a result from high-dimensional geometry which states that a set of points in a high-dimensional space can be embedded into a much lower-dimensional space in such a way that the distances between the points are nearly preserved [4]. Superposition is, in effect, the network learning such a dimensionality-reducing projection on its own.

2.3 Incidental vs. Necessary Polysemanticity: Two Origins of a Singular Phenomenon

While polysemanticity is the direct result of superposition, research has uncovered two distinct causal pathways for its emergence, with profoundly different implications for interpretability and AI safety.

The classic origin story is that of necessary polysemanticity. This view holds that polysemanticity is an unavoidable consequence of model capacity limitations. When a network is tasked with representing more features than it has neurons (m > d), it is mathematically forced to co-allocate multiple, unrelated features to the same neurons in order to minimize its loss function [7]. This perspective suggests that polysemanticity is a problem of insufficient resources.

However, a more recent and non-mutually exclusive theory introduces incidental polysemanticity [7]. This research shows that polysemantic neurons can arise even when there are ample neurons to represent all features distinctly (m < d). This “incidental” emergence is a product of the training dynamics themselves. It begins with random weight initialization, which can, by chance alone, make a single neuron slightly more correlated with two or more unrelated features than any other neuron. During training, dynamics such as those encouraged by L1 regularization or simply the nature of gradient descent can create a “winner-take-all” effect, where the neuron that starts with this slight random advantage gets progressively strengthened for all of those features, ultimately “winning” the responsibility for representing them and becoming polysemantic [29].

The distinction between these two origins is not merely academic; it has critical implications for how we might solve the problem of polysemanticity. The classic view of necessary polysemanticity suggests a brute-force, if expensive, solution: simply increase the model’s width until the number of neurons is sufficient to represent each feature monosemantically. However, the theory of incidental polysemanticity reveals the inadequacy of this approach. Mathematical analysis suggests that these incidental “collisions” of features onto single neurons can be expected to occur until the number of neurons m is significantly larger than the number of features n squared (m >> n^2) [29]. This quadratic relationship makes the “just add more neurons” strategy computationally infeasible for any system with a realistically large number of features.

This reframes polysemanticity from being a static architectural problem to being a dynamic, path-dependent outcome of the training process. It implies that robust solutions will likely not come from architectural scaling alone. Instead, they may require novel training methodologies designed to “nudge” the learning trajectory away from these incidental collisions. Such methods could include new regularization techniques, carefully designed noise injections during training, or alternative weight initialization schemes that discourage the initial random correlations that seed incidental polysemanticity [9]. This is a crucial consideration for AI safety, as it suggests that even extremely large, seemingly overparameterized models are not inherently immune to this phenomenon that fundamentally hinders our ability to interpret them [29].

Section 3: Anthropic’s Toy Models: A Foundational Research Paradigm

Much of the contemporary understanding of superposition is built upon a series of foundational papers from the research lab Anthropic. They pioneered the use of ”toy models“—small, simplified neural networks trained on synthetic data—to study complex phenomena in a controlled and fully understandable environment [5]. This research paradigm has been instrumental in moving superposition from a vague hypothesis to a concrete, experimentally verifiable theory. 🔬

3.1 Deconstructing the Toy Model: Synthetic Data and Experimental Setup

The core experimental setup in this line of research involves a simple autoencoder model [25]. The task is for the network to take a sparse, high-dimensional input vector representing a set of features, compress it through a linear layer into a low-dimensional hidden space (the “bottleneck”), and then decompress it using another linear layer and a ReLU activation to reconstruct the original input vector as accurately as possible [5].

The mathematical formulation of the model is straightforward. Given an input feature vector x with n dimensions, a hidden dimension d < n, an encoding weight matrix W_enc of size d x n, and a decoding weight matrix W_dec of size n x d, the reconstruction x_hat is given by: x_hat = ReLU(W_dec * (W_enc * x) + b)

The model’s loss function (L) is the mean squared error between the input (x_i) and the reconstruction (x_hat_i), weighted by a vector of feature importances, I. It can be expressed as L = average(I * (x_i - x_hat_i)^2) [25]. By systematically varying the key parameters of this setup—such as the input dimension n, the hidden dimension d, the sparsity of the input vectors x, and the distribution of feature importances I—researchers can create a controlled environment to observe precisely when and how the model chooses to represent features in superposition [25].

3.2 Key Findings: Phase Transitions and the Geometry of Feature Embedding

This toy model paradigm led to several landmark findings. The most significant was the discovery of a sharp phase transition between two distinct representational regimes [3]. When input features were dense, the model learned to represent only the few most important features, mapping them to an orthogonal basis in the hidden space. However, as the sparsity of the input features was increased past a critical point, the model underwent a phase transition and began to store many more features in superposition, accepting a small amount of interference as a trade-off for higher representational capacity.

As described previously, this research also uncovered the surprising geometric regularity of these superposition solutions. The learned weight matrices were not random but formed the vertices of uniform polytopes, such as a pentagon for five features in two dimensions [26]. This provided a clear, visualizable demonstration of the precise, optimal solutions that gradient descent finds to solve the superposition problem.

3.3 Computation in Superposition: Beyond Simple Representation

While the initial work focused on representational superposition—the passive storage of information being passed through a bottleneck—a key frontier of research is computational superposition [4]. This is the question of whether models can actively perform computations on features while they are still in their compressed, superposed state [4].

Early theoretical work has shown that this is possible and highly efficient. Researchers have presented explicit, provably correct constructions that allow a single-layer network to compute functions like the pairwise logical AND of many input features while those features are represented in superposition [4]. These constructions demonstrate an exponential efficiency gap compared to performing the same computation on non-superposed features. More recent analysis of trained models on similar tasks has found that they learn dense, binary-weighted circuits that approximate these theoretical constructions, suggesting that superposition is not merely a static storage mechanism but can function as a dynamic computational substrate within neural networks [32]. This opens up the possibility that complex algorithms within LLMs are being executed directly on these compressed, superposed representations.

The following table summarizes the foundational papers that have defined the field of superposition research, tracing the evolution of key ideas and findings.

Table 2: Foundational Papers in Superposition Research

Paper Title & Link	Primary Contribution	Model(s) / Task	Core Finding
Toy Models of Superposition [5]	Introduced the toy model paradigm and provided the first concrete evidence of superposition, linking it to feature sparsity and polysemanticity.	Small ReLU autoencoders on synthetic sparse data.	Demonstrated a phase transition to superposition driven by feature sparsity and revealed the emergence of uniform polytope geometry in learned weights.
Superposition, Memorization, and Double Descent [33]	Connected superposition to the phenomena of memorization, generalization, and double descent.	Simple neural networks on limited datasets.	Proposed that models store features in superposition to generalize and data points in superposition to memorize, with double descent marking the transition between these regimes.
Knowledge in Superposition [35]	Identified knowledge superposition as the fundamental reason for the failure of lifelong knowledge editing in LLMs.	Mathematical analysis and experiments on GPT-2, Llama-2, Pythia, etc.	Showed that editing introduces an “interference term” due to non-orthogonal (superposed) knowledge representations, which accumulates and causes catastrophic forgetting.
Sparse Autoencoders Find Highly Interpretable Features [39]	Demonstrated that Sparse Autoencoders (SAEs) can successfully disentangle superposed features in language models into monosemantic, interpretable components.	SAEs trained on LLM activations.	Showed that SAEs can find interpretable and less polysemantic features, providing a powerful tool for mechanistic interpretability.
Sparse Autoencoders Reveal Universal Feature Spaces [41]	Investigated the universality of features across different LLMs using SAEs and representational similarity analysis.	SAEs trained on Pythia and Gemma model families.	Provided evidence for “Analogous Feature Universality,” suggesting that the spaces spanned by SAE features are similar across models, especially in middle layers.

Part III: System-Level Implications of Superposition

The existence of superposition is not a niche academic curiosity; it has profound and far-reaching consequences for the most fundamental behaviors of large language models. It offers a mechanistic explanation for why models scale as they do, provides a new lens through which to understand the relationship between memorization and generalization, and presents a formidable challenge to the long-term maintenance and reliability of AI systems, particularly in the context of knowledge editing.

Section 4: Superposition, Scaling, and Generalization

Three of the most important empirical phenomena in deep learning are neural scaling laws, the trade-off between memorization and generalization, and double descent. Superposition provides a single, coherent, mechanistic hypothesis that links all three, suggesting it may be a fundamental principle governing how deep learning models work.

4.1 A Mechanistic Hypothesis for Neural Scaling Laws

The remarkable success of modern LLMs is built on neural scaling laws: the empirical observation that model performance (measured by loss) improves predictably as a power-law function of increases in model size, dataset size, or computational budget [44]. While these laws are used to train models like Chinchilla optimally, their theoretical origin has remained largely unclear [44].

A compelling new theory posits that superposition is a key mechanism underlying these scaling laws [44]. Researchers constructed a toy model based on two empirical principles: (1) LLMs represent more features than they have dimensions (i.e., they use superposition), and (2) concepts in language occur with varying frequencies (e.g., following a power law like Zipf’s law). This model was able to quantitatively reproduce the observed power-law scaling of loss with model size. The proposed geometric explanation is elegant: as more feature vectors are packed into a fixed-dimensional space, the average interference between them (measured by the squared dot product of their vector representations) scales inversely with the dimension of that space. If this interference is a primary contributor to the model’s loss, then the loss itself will decrease as the model dimension (i.e., width) increases [44]. Crucially, analysis of four families of open-source LLMs found that they exhibit the characteristics of ”strong superposition” predicted by the toy model, suggesting this mechanism is active in real-world systems and provides a plausible origin for the observed scaling laws [44].

4.2 The Duality of Memorization and Generalization: Storing Datapoints vs. Features

The classic dichotomy in machine learning is between memorization (overfitting to the training data) and generalization (learning patterns that apply to unseen data). Superposition offers a new, mechanistic perspective on this duality [33]. The hypothesis is that these two modes of learning correspond to the model using superposition to store two different kinds of information.

In the generalizing regime, the model uses superposition to store abstract, reusable features. For example, instead of memorizing thousands of sentences about cats, it learns a single, compressed representation for the concept of a cat, which can be efficiently combined with other features. This is an efficient way to learn the general patterns and structures present in the data distribution [33].

In the memorizing regime, which characterizes overfitting, the model uses superposition to store specific data points. The naive idea of a model dedicating one neuron to each training example it wishes to memorize is incredibly inefficient. However, the hypothesis that it stores these mutually exclusive data points in superposition is a far more plausible mechanism [33]. This allows the model to create a highly compressed “lookup table” that can perfectly reproduce answers for training examples it has seen before, but which fails to generalize to new inputs.

4.3 The Link to Double Descent: Transitioning Between Regimes

This duality provides a mechanistic explanation for the puzzling phenomenon of double descent. In many deep learning settings, as the size of the model or dataset is increased, the test loss doesn’t decrease monotonically. Instead, it first decreases (in the underparameterized regime), then increases to a peak around the ”interpolation threshold” (where the model has just enough capacity to fit the training data), and then decreases again (in the overparameterized regime) [26].

The superposition framework explains this behavior as the result of the model transitioning between its two representational strategies [27].

Small Data / Model Regime (First Descent): The model is in the generalizing regime, learning a small number of important features in superposition.
Interpolation Threshold (The “Bump”): As the model gains just enough capacity to memorize the training set, it begins to transition from a feature-storing strategy to a data-point-storing strategy. This intermediate phase is representationally messy. The model is trying to do both at once, and does neither well. The dimensionality allocated to both features and data points is low, and the resulting confusion and interference cause the test loss to increase [34].
Large Data / Model Regime (Second Descent): The model has fully transitioned into its overparameterized mode. In the case of increasing data, it settles back into a more robust generalizing regime, using its large capacity to store many features cleanly in superposition. In the case of increasing model size on a fixed dataset, it becomes very effective at memorizing data points in superposition.

This framework elevates superposition from an interpretability-specific concept to a potential unifying theory for some of the most fundamental empirical observations in deep learning. It suggests that understanding how models scale, why they generalize, and the dynamics of their training curves may be inseparable from understanding the principles of superposition.

Section 5: The Challenge of Knowledge Editing in Superposed Systems

While superposition provides benefits for computational efficiency and generalization, it creates a formidable challenge for the long-term maintenance and reliability of LLMs. This is most apparent in the domain of knowledge editing, the task of updating or correcting specific facts within a trained model without resorting to a full, costly retrain.

5.1 The Interference Term: Why Lifelong Editing Fails

A significant body of research has focused on developing methods to perform single, targeted edits on LLMs (e.g., changing the model’s belief from “The Eiffel Tower is in Rome” to “The Eiffel Tower is in Paris”). Methods like ROME and MEMIT have shown success in this single-edit setting [37]. However, these methods fail catastrophically when applied sequentially in a lifelong editing scenario, where the model must be continuously updated over time. After just a few dozen or a hundred edits, the model’s overall performance degrades severely, and it begins to forget both the original knowledge and previously edited facts [35].

Recent theoretical work has identified knowledge superposition as the “fundamental reason for the failure of lifelong editing” [35]. Rigorous mathematical derivations of the update rules used in popular editing methods reveal that each edit introduces an interference term [35]. This term represents the unintended, collateral impact of an edit on all other knowledge representations stored in the model.

The magnitude of this interference is directly determined by the degree of superposition. Lossless knowledge editing, where an edit affects only the target fact, is only theoretically possible if all knowledge representations in the model are perfectly orthogonal to one another [35]. Since superposition, by its very definition, stores features and knowledge in nearly-but-not-quite-orthogonal directions, a non-zero interference term is inevitable. With each subsequent edit, this interference accumulates, leading to a linear increase in model degradation and, eventually, catastrophic forgetting of both old and new knowledge [35].

5.2 Theoretical and Empirical Evidence of Universal Knowledge Superposition

To confirm this theoretical link, researchers have conducted extensive empirical investigations across a wide range of LLM families, including GPT-2, Llama-2, Pythia, and GPT-J [35]. These experiments have consistently found that knowledge superposition is a universal and widespread phenomenon in real-world language models [35].

Statistical analysis of the interference term across model layers reveals a distribution with high kurtosis and heavy tails [35]. This indicates that while the model attempts to store knowledge orthogonally (the peak of the distribution is near zero), its finite capacity forces it to resort to storing most knowledge in nearly-orthogonal, superposed directions.

This line of research has also uncovered scaling laws for knowledge superposition. As models become larger, the measured degree of superposition tends to decrease [35]. This suggests that larger models, with their greater capacity, are better able to organize knowledge in a more orthogonal, less entangled manner, which correlates directly with their improved performance on knowledge-intensive tasks.

5.3 Emerging Mitigation Strategies: Towards Lossless Knowledge Editing

The identification of superposition as the root cause of editing failure has catalyzed a new wave of research focused on developing mitigation strategies [35]. While no perfect solution currently exists, several promising directions are emerging that aim to make knowledge editing more robust and scalable.

Orthogonal Subspace Editing (O-Edit): This proposed algorithm directly tackles the interference problem by attempting to orthogonalize the direction of each knowledge update relative to previous ones. By ensuring each new edit is made in a subspace that is orthogonal to past edits, it aims to minimize the cumulative interference [46].
Adaptive Smoothing (OVERTONE): This technique addresses a related problem called “heterogeneous token overfitting” (HTO), where the model overfits to certain tokens in the edited fact more rigidly than others. OVERTONE is a token-level smoothing method that adaptively refines the target distribution during editing, encouraging the model to learn the new fact more flexibly and reducing overfitting [48].
Fine-grained Neuron-level Editing (FiNE): Instead of applying broad parameter updates, this method seeks to improve editing locality. It uses causal tracing to identify the specific, individual neurons within the feed-forward networks that are most responsible for recalling a piece of knowledge and then modifies only those neurons, minimizing collateral damage to unrelated knowledge [50].
Superposition-Aware Architectures: A more radical approach involves designing architectures that explicitly manage superposition. One novel method uses autoencoders to superimpose the hidden states of a base model and a fine-tuned model within a shared parameter space. This allows the model to dynamically switch between the base and specialized knowledge states, effectively mitigating catastrophic forgetting by keeping the knowledge domains reconstructable but separate [51].

Part IV: Deconstructing Superposition: Tools and Techniques

The abstract nature of superposition necessitates specialized tools and methodologies to make it concrete and analyzable. The field of mechanistic interpretability has developed a sophisticated toolkit for this purpose, with Sparse Autoencoders (SAEs) emerging as the primary instrument for deconstructing superposed representations. This is supported by a growing ecosystem of open-source libraries and collaborative platforms that are accelerating research. 🧩

Section 6: Sparse Autoencoders (SAEs) as the Primary Interpretability Tool

Sparse Autoencoders are the cornerstone of modern efforts to resolve superposition. They function as a kind of computational microscope, allowing researchers to take the dense, tangled activations from an LLM and disentangle them into a sparse set of meaningful, monosemantic features.

6.1 Architectural Deep Dive: The Encoder-Decoder and Sparsity Constraints

An SAE is not part of the original LLM architecture; it is a separate, smaller neural network that is trained post-hoc on the internal activations of a pre-trained LLM [53]. For example, an SAE might be trained on the output of the residual stream of a specific transformer layer.

The core architecture of an SAE is simple, consisting of two main components [53]:

An Encoder: This component takes a dense activation vector from the LLM (e.g., of dimension d_model = 768) and projects it into a much higher-dimensional latent space (e.g., of dimension d_sae = 24,576). This is typically implemented as a single linear layer followed by a ReLU activation function.
A Decoder: This component takes the sparse, high-dimensional representation from the latent space and projects it back down, attempting to reconstruct the original LLM activation vector. This is typically a single linear layer.

The crucial element that makes this architecture work is a strong sparsity constraint applied to the SAE’s hidden (latent) layer. This constraint forces most of the neurons in the latent space to be zero for any given input. The goal is to learn a representation where the LLM’s complex activation can be explained as a linear combination of just a few active features from a very large dictionary of possible features. This sparsity can be enforced in several ways, most commonly through an L1 penalty on the latent activations added to the loss function, or more directly using a Top-K activation function that explicitly sets all but the k largest activations to zero [53]. It is this sparsity that compels the SAE to learn a “disentangled” or “monosemantic” feature basis, where each dimension in its latent space corresponds to a single, interpretable concept [10].

6.2 The Training Process: From Activation Capture to Feature Disentanglement

The process of using an SAE to interpret an LLM involves several distinct steps:

Activation Capture: The first step is to generate a large dataset of activations from the target LLM. This is done by running the LLM over a massive and diverse text corpus, such as The Pile or a subset like OpenWebText, and saving the activation vectors from a specific, chosen layer (e.g., the residual stream of layer 8) for each token or sentence [53].
Overcomplete Dictionary Learning: The SAE training process is a form of dictionary learning. The decoder’s weight matrix can be viewed as the “dictionary,” where each column is a vector representing a single learned feature. The encoder’s job is to learn how to represent any given input activation as a sparse linear combination of these dictionary features (i.e., by outputting the coefficients for the linear combination) [40]. Because the latent dimension of the SAE is much larger than the input dimension, this is known as an overcomplete dictionary.
Balancing Reconstruction and Sparsity: The SAE is trained by optimizing a loss function that balances two competing objectives: minimizing the reconstruction loss (typically the mean squared error between the original LLM activation and the SAE’s reconstructed activation) and maximizing sparsity (typically by minimizing the L1 norm of the latent activations) [53]. This trade-off is critical: too much emphasis on reconstruction will lead to a dense, non-interpretable solution, while too much emphasis on sparsity will lead to poor reconstruction and information loss.

6.3 From Polysemantic Neurons to Monosemantic Features: Evaluating SAE Success

The ultimate goal of training an SAE is to transform the LLM’s opaque, polysemantic neuron activations into a sparse vector of monosemantic features. A perfectly successful SAE would produce a feature dictionary where each latent dimension corresponds to a single, distinct, and human-interpretable concept [2].

Evaluating the success of an SAE is a significant challenge in itself. The primary method involves qualitative analysis. Researchers identify the text examples from the training corpus that cause the highest activation for a given SAE feature. They then inspect these “maximally activating examples” to see if they share a common, coherent theme. For instance, a feature might consistently activate on text related to legal documents, another on Python code involving lists, and another on sentences expressing sadness [39]. This process can be scaled up by using other powerful LLMs, like Claude or GPT-4, to automatically generate natural language descriptions of what each feature appears to represent [55]. The goal is to build a dictionary that maps the opaque internal state of an LLM to a sparse vector of understandable concepts, which can then be used to analyze, debug, and even control the model’s behavior.

Section 7: The Open-Source Interpretability Ecosystem

The rapid progress in understanding superposition has been fueled by a vibrant open-source ecosystem of tools, libraries, and collaborative platforms. This ecosystem allows researchers to build on each other’s work, share models and findings, and standardize methodologies.

7.1 Libraries for Mechanistic Analysis

A number of key libraries form the foundation of modern mechanistic interpretability research:

TransformerLens: Developed by Neel Nanda, this library is designed for general-purpose mechanistic interpretability of transformer models. Its core feature is the ability to easily load open-source LLMs and gain access to all their internal activations via “hooks.” These hooks allow a researcher to cache, view, edit, or ablate any activation during a forward pass. This makes it an indispensable tool for techniques like activation patching and circuit analysis, which are used to determine the causal role of different model components [57].
SAELens: This is a specialized library focused exclusively on sparse autoencoders. It is designed to help researchers train new SAEs, download and analyze pre-trained SAEs, and generate feature dashboards and visualizations using the integrated SAE-Vis library. It aims to streamline the entire SAE research workflow [57].
Other Key Repositories: The broader ecosystem, primarily hosted on GitHub, includes many other valuable tools. These range from the official SAE implementations released by major labs like OpenAI and EleutherAI [40] to complete end-to-end pipelines like llama3_interpretability_sae, which provides a fully reproducible workflow from activation capture to SAE training and analysis on the Llama 3 model [55]. General-purpose intervention libraries like nnsight and pyvene also provide powerful frameworks for conducting causal experiments on model internals [57].

7.2 Platforms and Repositories for Collaborative Research

Beyond code libraries, several platforms and communities are crucial for the dissemination and discussion of research:

GitHub: This is the central hub for sharing code and data. Curated “Awesome” lists, such as Awesome-Interpretability-in-Large-Language-Models and Awesome-Sparse-Autoencoders, serve as invaluable, continuously updated directories of the latest papers, tools, and tutorials in the field [40].
Neuronpedia: This is an open, collaborative platform designed for sharing the results of SAE analysis. It allows researchers to upload the interpretable features they have discovered from various models. Other researchers can then browse these features, view their maximally activating examples, and read the human-generated explanations of what they represent. This creates a shared, growing encyclopedia of the concepts learned by LLMs [57].
Research Forums: Online communities, particularly LessWrong and the AI Alignment Forum, play a vital role in the intellectual life of the field. They provide a venue for researchers to post initial findings, float new theories, and engage in rapid, informal discussion and debate that moves much faster than the traditional academic peer-review cycle [57].

7.3 A Survey of Available Pre-trained SAEs and Activation Datasets

A major catalyst for the field has been the public release of pre-trained SAEs for popular open-source models like GPT-2, Pythia, Gemma, and Llama [40]. Training SAEs is computationally expensive, and the availability of these pre-trained artifacts allows a much wider range of researchers to participate in analysis without needing access to large-scale compute resources. Furthermore, the standardization of activation datasets, which are often derived from large, open corpora like The Pile or OpenWebText, allows for more direct and meaningful comparisons of results across different research groups and methodologies [55]. This shared infrastructure is critical for building a cumulative science of interpretability.

Part V: The Research Frontier and Open Problems

While the study of superposition has yielded profound insights, it has also revealed the immense scale of the challenges that lie ahead. The research frontier is now focused on pushing the boundaries of these findings, seeking universal principles that govern all models and tackling the formidable open problems that must be solved to make interpretability a scalable, practical engineering discipline.

Section 8: The Quest for Universality

A central ambition of mechanistic interpretability is to move beyond analyzing bespoke phenomena in individual models and discover universal principles of neural representation. If different models trained on similar data learn similar internal structures, then insights and safety techniques could be generalized, transforming interpretability from a craft into a science.

8.1 Defining and Testing for Universal Features Across Different LLMs

The Universality Hypothesis posits that different LLMs, particularly those with similar architectures, converge to learn similar internal feature representations when trained on large, diverse datasets [41]. Proving this would be a monumental breakthrough. It would mean that the specific features discovered in a model like GPT-2 Small could provide a map for understanding the representations in a much larger model like Llama 3.

A more recent and nuanced version of this idea is Analogous Feature Universality [43]. This hypothesis suggests that even if the exact feature vectors learned by two different models are not identical, the high-dimensional spaces spanned by their SAE features are similar up to some transformation (like a rotation). If true, this would imply that interpretability techniques that operate on the geometry of the latent space, such as finding “steering vectors” to control model behavior, could be transferred from one model to another with a simple linear mapping.

8.2 Using SAEs and Representational Similarity Metrics to Compare Feature Spaces

The primary methodology for testing these hypotheses leverages the tools developed for superposition analysis [39]. The process involves:

Training SAEs on the activations of two or more different LLMs to obtain their respective disentangled feature dictionaries.
Matching corresponding features across the models, typically by finding pairs of features that have the highest correlation in their activation patterns on a shared dataset.
Applying representational similarity metrics to quantify the similarity between the matched feature spaces. Key metrics include Singular Value Canonical Correlation Analysis (SVCCA), which finds common directions of variance between two spaces, and Representational Similarity Analysis (RSA), which compares the geometric structure of relationships between features within each space [39].

Initial findings from this line of research are promising. Experiments comparing models within the Pythia and Gemma families have provided evidence for feature universality, revealing statistically significant similarities in their SAE feature spaces [41]. This similarity appears to be strongest in the middle layers of the models, suggesting a pattern where early layers process low-level information, middle layers converge on a more universal set of abstract concepts, and later layers specialize again for the final prediction task. Furthermore, when analyzing specific semantic subspaces (e.g., the set of all features related to “emotions” or “programming”), the similarity across models is even higher, reinforcing the idea that models are learning analogous conceptual representations [39].

8.3 Implications of a Universal Feature Space for Generalizable Interpretability

The discovery of universal features or feature spaces would have transformative implications for the field.

A Predictive Science of Deep Learning: It would be a major step toward a more rigorous and predictive science of AI. Instead of being limited to post-hoc explanations of individual model behaviors, researchers could begin to form generalizable theories about how neural networks represent and process information.
Transferable Safety and Alignment Techniques: The most significant practical benefit would be for AI safety. If harmful capabilities like deception, bias, or the pursuit of dangerous goals are implemented by specific, identifiable features, and if those features are universal, then safety techniques could become transferable. A method developed to detect and ablate a “deception feature” in one open-source model could potentially be adapted to find and neutralize the analogous feature in a closed-source, frontier model, greatly improving our ability to audit and secure powerful AI systems.

Section 9: Open Problems and Future Directions

Despite rapid progress, the field of mechanistic interpretability is still in its infancy, and many fundamental challenges remain. The path from our current understanding to a complete, scalable science of interpretability is paved with difficult open problems that are the focus of intense research at top conferences like ICLR and NeurIPS.

9.1 Scaling Interpretability: The Challenge of Immense Models and Distributed Representations

The most pressing challenge is scalability. Current interpretability methods, including both circuit analysis and SAE training, often focus on small models (e.g., GPT-2 Small) and do not scale gracefully to frontier models with hundreds of billions or even trillions of parameters [62]. This is sometimes referred to as the ”scaling paradox”: as models become larger and more capable, they also become exponentially more complex and opaque, making them harder to interpret [65].

The computational cost of training SAEs or running the vast number of forward passes required for causal analysis on state-of-the-art models is immense, creating a significant bottleneck for research [64]. The ultimate scalability of methods that rely on learning transformations of the activation space is bounded by the hidden dimension size, which can be enormous in modern LLMs [62].

Furthermore, validating interpretations at scale is an unsolved problem. It is difficult to rigorously verify that a discovered explanation for a model’s behavior is truly faithful to its internal mechanism and not just a plausible-sounding, cherry-picked narrative [63]. This challenge is compounded by the risk of “hallucinated explanations,” where an LLM used to explain another LLM’s behavior may simply generate a plausible but incorrect rationale [66].

9.2 Beyond Representation: Understanding Circuits and Computation in Superposition

A major theme in recent research is the push to move beyond understanding what features are represented to understanding how the model computes with them [4]. This involves identifying circuits: subgraphs of specific neurons and attention heads that work together to implement recognizable algorithms, such as indirect object identification or factual recall [66].

A fundamental open problem in this domain is understanding feature evolution across layers. How are simple features in early layers composed and transformed into more abstract features in later layers? How does the model perform computations on these features while they are in a superposed state? Recent work like SAE-Match, which proposes a data-free method for aligning SAE features between different layers of a network, represents an early step toward tackling this problem [69].

Given the complexity of these circuits, automation is essential. Manually reverse-engineering the algorithm for even a simple behavior can take months of effort. A key research priority is therefore the development of automated or semi-automated methods for circuit discovery that can scale to the complexity of modern LLMs [57].

9.3 The Path Forward: Key Questions from ICLR, NeurIPS, and the Broader Research Community

The future of the field will be defined by the pursuit of answers to several key open questions that are actively being explored by the research community:

How can we improve SAEs? Current research is exploring architectural variations like Gated SAEs and JumpReLU SAEs, which aim to achieve a better trade-off between reconstruction fidelity and feature sparsity, leading to more interpretable and useful feature dictionaries [40].
Are all features linear? The current SAE paradigm is based on the Linear Representation Hypothesis, which assumes that concepts are represented as linear combinations of neuron activations. Some recent work suggests that important features may in fact be non-linear, which would pose a major challenge to existing methods and may require entirely new tools to discover [40].
How can we connect interpretability to concrete engineering goals? A critical goal for the field is to translate insights into practical applications. This includes using interpretability to audit models for bias and safety [63], to debug and improve model performance, and even to derive formal, provable guarantees about model behavior on specific tasks [71].
What are the next paradigms for interpretability? While SAEs and circuit analysis are the dominant paradigms today, researchers are exploring other approaches. These include using LLMs themselves to generate interactive, natural language explanations of their own reasoning [66] and designing models that are more inherently interpretable by incorporating transparent, rule-based components [70].

Conclusion

Superposition is far more than a technical curiosity; it is a fundamental organizing principle of modern large language models. It stands as a powerful, learned compression strategy that enables the remarkable information density and scaling properties of these systems. At the same time, it is the primary source of polysemanticity, the central obstacle to achieving true mechanistic interpretability. The research journey into superposition has revealed its deep connections to the core phenomena of deep learning—linking scaling laws, generalization, memorization, and double descent under a single, coherent mechanistic framework. It has also exposed the profound challenges that superposed knowledge representations pose to the long-term safety and maintainability of AI, particularly in the critical domain of lifelong knowledge editing.

The development of tools like Sparse Autoencoders has transformed the study of superposition from a theoretical exercise into an empirical science, allowing researchers to peer inside the black box and disentangle opaque activations into meaningful, monosemantic features. This has catalyzed a vibrant open-source ecosystem and an ambitious research agenda aimed at discovering universal principles of neural representation.

Yet, the path forward is defined by formidable open problems. Scaling these interpretability techniques to the immense size of frontier models, moving from understanding static representations to dynamic computations, and validating findings with scientific rigor remain profound challenges. The solutions will likely require not only conceptual and practical improvements to our current methods but also the exploration of entirely new paradigms for building and understanding intelligent systems. The continued effort to resolve the mystery of superposition is therefore not just a quest to explain our current models; it is integral to the broader scientific and engineering endeavor of building more capable, reliable, and ultimately, trustworthy artificial intelligence.

Works Cited

[1] “Superposition: What Makes it Difficult to Explain Neural Network”. Towards Data Science. Accessed: Aug. 5, 2025. Available: https://towardsdatascience.com/superposition-what-makes-it-difficult-to-explain-neural-network-565087243be4 [2] “Understanding Superposition in Neural Networks: A Guide Through Analogies”. Medium. Accessed: Aug. 5, 2025. Available: https://medium.com/@hvoecking/understanding-superposition-in-neural-networks-e6e029488284 [3] “Unraveling the Mystery of Superposition in Large Language Models”. Medium. Accessed: Aug. 5, 2025. Available: https://medium.com/@vedantgaur101/unraveling-the-mystery-of-superposition-in-large-language-models-ef8f1715809a [4] “On the Complexity of Neural Computation in Superposition”. DSpace@MIT. Accessed: Aug. 5, 2025. Available: https://dspace.mit.edu/bitstream/handle/1721.1/157073/Superposition.pdf?sequence=1&isAllowed=y [5] “Toy Models of Superposition”. Anthropic. Accessed: Aug. 5, 2025. Available: https://www.anthropic.com/research/toy-models-of-superposition [6] “Superposition through Active Learning Lens”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2412.16168v1 [7] “Incidental Polysemanticity: A New Obstacle for Mechanistic …“. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=OeHSkJ58TG [8] “Understanding Polysemanticity in AI: Multiple Meanings in Neural Networks”. Alphanome.AI. Accessed: Aug. 5, 2025. Available: https://www.alphanome.ai/post/understanding-polysemanticity-in-ai-multiple-meanings-in-neural-networks [9] “What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes”. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=AHfE6WeJLQ [10] “Formulation of Feature Circuits with Sparse Autoencoders in LLM”. Towards Data Science. Accessed: Aug. 5, 2025. Available: https://towardsdatascience.com/formulation-of-feature-circuits-with-sparse-autoencoders-in-llm/ [11] ”[PDF] Toy Models of Superposition”. Semantic Scholar. Accessed: Aug. 5, 2025. Available: https://www.semanticscholar.org/paper/Toy-Models-of-Superposition-Elhage-Hume/9d125f45b1d2dea01f05281470bc08e12b6c7cba [12] “LLMs, Quantum Physics, and the Language of the Unthinkable”. Psychology Today. Accessed: Aug. 5, 2025. Available: https://www.psychologytoday.com/us/blog/the-digital-self/202504/llms-quantum-physics-and-the-language-of-the-unthinkable [13] “The Quantum LLM: Modeling Semantic Spaces with Quantum Principles”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2504.13202v2 [14] “The output of an LLM is produced by a superposition of simulated entities…“. Hacker News. Accessed: Aug. 5, 2025. Available: https://news.ycombinator.com/item?id=35043613 [15] “Google Quantum Computer: Shaping Tomorrow’s Technology”. BlueQubit. Accessed: Aug. 5, 2025. Available: https://www.bluequbit.io/google-quantum-computer [16] “Superposition and entanglement”. Quantum Inspire. Accessed: Aug. 5, 2025. Available: https://www.quantum-inspire.com/kbase/superposition-and-entanglement/ [17] “5 Concepts Can Help You Understand Quantum Mechanics and Technology — Without Math!“. NIST. Accessed: Aug. 5, 2025. Available: https://www.nist.gov/blogs/taking-measure/5-concepts-can-help-you-understand-quantum-mechanics-and-technology-without [18] “Quantum Neural Networks (QNNs)“. Classiq. Accessed: Aug. 5, 2025. Available: https://www.classiq.io/insights/quantum-neural-networks-qnns [19] “How do the phenomena of superposition and entanglement enable quantum computers…?“. EITCA Academy. Accessed: Aug. 5, 2025. Available: https://eitca.org/artificial-intelligence/eitc-ai-tfqml-tensorflow-quantum-machine-learning/introduction-eitc-ai-tfqml-tensorflow-quantum-machine-learning/introduction-to-google-ai-quantum/examination-review-introduction-to-google-ai-quantum/how-do-the-phenomena-of-superposition-and-entanglement-enable-quantum-computers-to-perform-certain-calculations-more-efficiently-than-classical-computers/ [20] “ELI5: How do quantum computers use superposition and entanglement…?“. Reddit. Accessed: Aug. 5, 2025. Available: https://www.reddit.com/r/explainlikeimfive/comments/1e1deup/eli5_how_do_quantum_computers_use_superposition/ [21] “From superposition to sparse codes: interpretable representations in neural networks”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2503.01824v1 [22] “When Numbers Are Bits: How Efficient Are Distributed Representations?“. cgad.ski. Accessed: Aug. 5, 2025. Available: https://cgad.ski/blog/when-numbers-are-bits.html [23] “Distributed Representations: Composition & Superposition”. Transformer Circuits. Accessed: Aug. 5, 2025. Available: https://transformer-circuits.pub/2023/superposition-composition/index.html [24] “Mathematical Models of Computation in Superposition”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2408.05451v1 [25] “Toy Models of Superposition: Simplified by Hand”. LessWrong. Accessed: Aug. 5, 2025. Available: https://www.lesswrong.com/posts/8CJuugNkH5FSS9H2w/toy-models-of-superposition-simplified-by-hand [26] “Superposition, Memorization, and Double Descent”. Transformer Circuits Thread. Accessed: Aug. 5, 2025. Available: https://transformer-circuits.pub/2023/toy-double-descent/index.html [27] ”[1.3.1] Toy Models of Superposition & Sparse Autoencoders”. Transformer Interpretability. Accessed: Aug. 5, 2025. Available: https://arena-chapter1-transformer-interp.streamlit.app/[1.3.1]Toy_Models_of_Superposition&_SAEs [28] “Toy Models of Superposition”. ResearchGate. Accessed: Aug. 5, 2025. Available: https://www.researchgate.net/publication/363766017_Toy_Models_of_Superposition [29] “Incidental polysemanticity”. LessWrong. Accessed: Aug. 5, 2025. Available: https://www.lesswrong.com/posts/sEyWufriufTnBKnTG/incidental-polysemanticity [30] “The Persian Rug: solving toy models of superposition using large-scale symmetries”. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=rapXZIfwbX [31] “Mathematical Models of Computation in Superposition”. Squarespace. Accessed: Aug. 5, 2025. Available: https://static1.squarespace.com/static/663d1233249bce4815fe8753/t/68067911dfddce5181366c8a/1745254675242/mathematical+models+of+computation+in+superposition+-+Jake+Mendel.pdf [32] “Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2507.09816v1 [33] “Superposition, Memorization, and Double Descent”. Anthropic. Accessed: Aug. 5, 2025. Available: https://www.anthropic.com/research/superposition-memorization-and-double-descent [34] “Paper: Superposition, Memorization, and Double Descent (Anthropic)“. LessWrong. Accessed: Aug. 5, 2025. Available: https://www.lesswrong.com/posts/6Ks6p33LQyfFkNtYE/paper-superposition-memorization-and-double-descent [35] ”[Literature Review] Knowledge in Superposition: Unveiling the …“. The Moonlight. Accessed: Aug. 5, 2025. Available: https://www.themoonlight.io/en/review/knowledge-in-superposition-unveiling-the-failures-of-lifelong-knowledge-editing-for-large-language-models [36] “Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models”. ResearchGate. Accessed: Aug. 5, 2025. Available: https://www.researchgate.net/publication/383119592_Knowledge_in_Superposition_Unveiling_the_Failures_of_Lifelong_Knowledge_Editing_for_Large_Language_Models [37] “Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2408.07413v1 [38] “Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models”. AAAI. Accessed: Aug. 5, 2025. Available: https://ojs.aaai.org/index.php/AAAI/article/view/34583/36738 [39] “Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2410.06981v1 [40] “chrisliu298/awesome-sparse-autoencoders”. GitHub. Accessed: Aug. 5, 2025. Available: https://github.com/chrisliu298/awesome-sparse-autoencoders [41] “Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models”. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=rbHOLX8OWh [42] ”[Literature Review] Sparse Autoencoders Reveal Universal Feature Spaces…“. The Moonlight. Accessed: Aug. 5, 2025. Available: https://www.themoonlight.io/en/review/sparse-autoencoders-reveal-universal-feature-spaces-across-large-language-models [43] ”[2410.06981] Quantifying Feature Space Universality Across Large Language Models…“. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/abs/2410.06981 [44] “Superposition Yields Robust Neural Scaling”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2505.10465v1 [45] “Superposition Yields Robust Neural Scaling”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2505.10465v2 [46] “Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing…“. ResearchGate. Accessed: Aug. 5, 2025. Available: https://www.researchgate.net/publication/390699417_Knowledge_in_Superposition_Unveiling_the_Failures_of_Lifelong_Knowledge_Editing_for_Large_Language_Models [47] ”[2408.07413] Knowledge in Superposition: Unveiling the Failures…“. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/abs/2408.07413 [48] ”[2502.00602] Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/abs/2502.00602 [49] “Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing”. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=vOu5K93z4f&referrer=%5Bthe%20profile%20of%20Hui%20Liu%5D(%2Fprofile%3Fid%3D~Hui_Liu3 [50] “Precise Localization of Memories: A Fine-grained Neuron-level Knowledge Editing…“. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=5xP1HDvpXI [51] “Superposition in Transformers: A Novel Way of Building Mixture of Experts”. Hugging Face. Accessed: Aug. 5, 2025. Available: https://huggingface.co/blog/BenChaliah/superposition-in-transformers [52] “Superposition in Transformers: A Novel Way of Building Mixture of Experts”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2501.00530v1 [53] “Sparse AutoEncoder: from Superposition to interpretable features”. Medium. Accessed: Aug. 5, 2025. Available: https://medium.com/data-science/sparse-autoencoder-from-superposition-to-interpretable-features-4764bb37927d [54] “An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability”. Adam Karvonen. Accessed: Aug. 5, 2025. Available: https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html [55] “PaulPauls/llama3_interpretability_sae: A complete end-to …“. GitHub. Accessed: Aug. 5, 2025. Available: https://github.com/PaulPauls/llama3_interpretability_sae [56] “Local vs distributed representations: What is the right basis for interpretability?“. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=fmWVPbRGC4 [57] “ruizheliUOA/Awesome-Interpretability-in-Large-Language …“. GitHub. Accessed: Aug. 5, 2025. Available: https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models [58] “JShollaj/awesome-llm-interpretability”. GitHub. Accessed: Aug. 5, 2025. Available: https://github.com/JShollaj/awesome-llm-interpretability [59] “jbloomAus/SAELens: Training Sparse Autoencoders on …“. GitHub. Accessed: Aug. 5, 2025. Available: https://github.com/jbloomAus/SAELens [60] “hijohnnylin/mats_sae_training: Training Sparse Autoencoders on Language Models”. GitHub. Accessed: Aug. 5, 2025. Available: https://github.com/hijohnnylin/mats_sae_training [61] “Quantifying Feature Space Universality Across Large Language Models…“. arXiv. Accessed: Aug. 5, 2025. Available: http://arxiv.org/pdf/2410.06981 [62] “Scaling interpretability with LLMs”. Stanford NLP Group. Accessed: Aug. 5, 2025. Available: https://nlp.stanford.edu/~wuzhengx/boundless_das/index.html [63] “LLM Interpretability: Understanding What Models Learn and Why”. Medium. Accessed: Aug. 5, 2025. Available: https://medium.com/@rizqimulkisrc/llm-interpretability-understanding-what-models-learn-and-why-18da790629fb [64] ”[2402.01761] Rethinking Interpretability in the Era of Large Language Models”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/abs/2402.01761 [65] “The Explainability Challenge of Generative AI and LLMs”. OCEG. Accessed: Aug. 5, 2025. Available: https://www.oceg.org/the-explainability-challenge-of-generative-ai-and-llms/ [66] “Rethinking Interpretability in the Era of Large Language Models”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2402.01761v1 [67] ”[2501.16496] Open Problems in Mechanistic Interpretability”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/abs/2501.16496 [68] “Open Problems in Mechanistic Interpretability”. OpenReview. Accessed: Aug. 5, 2025. Available: https://openreview.net/forum?id=91H76m9Z94 [69] “ICLR Poster Mechanistic Permutability: Match Features Across Layers”. ICLR 2025. Accessed: Aug. 5, 2025. Available: https://iclr.cc/virtual/2025/poster/29946 [70] “Interpretable AI: Past, Present and Future”. Interpretable AI Workshop. Accessed: Aug. 5, 2025. Available: https://interpretable-ai-workshop.github.io/ [71] “Compact Proofs of Model Performance via Mechanistic Interpretability”. NeurIPS. Accessed: Aug. 5, 2025. Available: https://proceedings.neurips.cc/paper_files/paper/2024/hash/90e73f3cf1a6c84c723a2e8b7fb2b2c1-Abstract-Conference.html [72] “LLMs for Explainable AI: A Comprehensive Survey”. arXiv. Accessed: Aug. 5, 2025. Available: https://arxiv.org/html/2504.00125v1 [73] “NeurIPS Poster Crafting Interpretable Embeddings for Language Neuroscience…“. NeurIPS. Accessed: Aug. 5, 2025. Available: https://neurips.cc/virtual/2024/poster/93720