TL;DR
Dead neurons silently waste compute and reduce effective model capacity in foundation models.
Simple visualizations of the activation frequency make neuron health measurable.
Dead neurons can be brought back to life by swapping activation functions or implementing synaptic stripping.
It is crucial for foundation model training success to proactively monitor neuron health with audits and alerts.
In neural networks, some neurons end up outputting near-zero activations across all inputs. These so-called “dead neurons” degrade model capacity because those parameters are effectively wasted, and they weaken generalization by reducing the diversity of learned features.
While this phenomenon is nothing new, it has become increasingly relevant with the emergence of large foundation models. In this article, we will discuss why that is the case and what the resulting impact is. We will also review methods for the detection and visualization of dead neurons, as well as strategies to prevent and fix them.
Dead neurons’ impact
Recent studies into dead neurons in the context of foundation models show interesting, albeit worrying, results. A 2020 paper by Qatari researchers Dalvi et al. shows how in BERT and XLNet, 85% of all neurons are redundant for it to perform its task. A more recent 2023 study by Meta AI researchers Voita et al. looked at LLMs from the OPT family of models, ranging from 125M to 66B parameters, only to find that, in some layers, more than 70% of the neurons are dead.
These large reported fractions of dead neurons in foundation models are a concern from a computational perspective. While in a 100M-parameter CNN losing some neurons is an inefficiency, seeing 70-85% of neurons dead in a billion-parameter LLM means significant amounts of GPU-hours wasted, both at training and inference time. These dead neurons constitute a hidden form of compute tax, if you will.
Leaving the computational efficiency aside, dead neurons are likely to impede the model’s performance, too. With a large number of neurons unused, the effective model size becomes much smaller than its nominal size. Consequently, fewer features are learned, leading to impaired generalization as the model increasingly relies on memorizing the data.
Another consequence of having many dead neurons in the model is that it learns a more entangled data representation. Consider discrete feature detectors, or neurons that reliably activate for some interpretable pattern in the data. Think of a neuron that lights up whenever it sees a vertical edge in a vision model, or a neuron that fires strongly on HTML tags in an LLM. These types of neurons are quite valuable to have in a model as they make representations more disentangled: each dimension of the representation corresponds more cleanly to a specific factor of variation.
If a large fraction of neurons are dead, we lose the “slots” that could have been allocated to these specialized detectors. The model still has to encode the same amount of information, but with fewer working neurons. As a result, the remaining neurons activate for a variety of patterns (e.g., one neuron might respond to both numbers and capital letters and dates). This reduces the model’s ability to learn clean, specialized representations, potentially affecting downstream performance.
Finally, and perhaps not surprisingly, dead neurons waste memory. They take up a lot of space for no good reason, making it more challenging to load, fine-tune, and serve large foundation models.
Before we move on to discuss how to detect and fix dead neurons, let’s touch upon an important distinction between dead neurons and vanishing gradients. While these two are distinct phenomena, they are intimately related. Vanishing gradients effectively prevent weight updates during training, which can “freeze” a neuron into inactivity. Conversely, once a neuron becomes permanently dead, it contributes nothing to the gradient flow downstream of it. Thus, preventing gradients from vanishing is one of the strategies against dead neurons, as we will later later in the article.
Visualizing activation distributions
Is your foundation model suffering from dead neurons? A convenient way to find out is through visualization. We can plot activation histograms and heatmaps, as well as the percentage of dead neurons for different layers of the model, to get a sense of how large the issue is.
In this section, we will examine these visualization strategies using a version of OpenAI’s GPT-2 as an example. We use this relatively small model for computational efficiency. Note that in such a small model, we might not see as high a proportion of dead neurons as we would in a bigger, more recent model such as GPT-5. However, the techniques we will discuss are directly applicable to larger models, too.
💡 You can explore all charts interactively on this Neptune dashboard. The code used to produce the plots is available on GitHub.
I have sampled some data from the WikiText-2 dataset and passed it through Tiny GPT-2 from HuggingFace (see its model card for additional information). For each batch of tokens processed by the model, I collected a set of different activations from the transformer blocks at different layers:
- mlp_pre: Activations before the activation functions.
- mlp_post: Activations after the activation functions.
- attn_out: The outputs of the self-attention block.
I flattened and aggregated these activations to extract the following metrics:
- Activation frequency: The fraction of inputs where a neuron fires above an arbitrarily chosen threshold of 0.001.
- Activation histograms: The distribution of activation values.
- Dead neuron ratio: The percentage of neurons with an activation frequency below the same firing threshold as above.
Activation frequency
Let’s start by looking at the activation frequencies:

The six panes show the activation frequencies for two of the model’s layers (first with index 0 and sixth with index 5), shown across rows, for mlp_pre, mlp_post, and attn_out, shown across columns.
The horizontal axis shows consecutive neurons, sorted by how often they fire. Colors mark the fraction of inputs activating the corresponding neuron. Blue neurons basically never fire, while perfectly yellow neurons fire on every token.
Note that the color legend for mlp_pre and attn_out spans only very high values, all above 99%, meaning that those neurons are very much alive. The mlp_post outputs, however, look quite different. Their colormap covers a much broader dynamic range: some neurons fire almost constantly (close to yellow), but a substantial group sits at the low end, firing very rarely (down to 20%). This uneven distribution is expected because, after the non-linear activation (GELU, more on that later), many neurons are pushed close to zero most of the time.
The key takeaway from these heatmaps is that “dead” or underused neurons mostly appear after the nonlinearity (mlp_post). That’s exactly where we would expect it, since activations are being gated. The pre-activation and attention projections, in contrast, show high activity. This is a desired pattern for our foundation model.
Activation histograms
Let’s now turn our attention to the distributions of activation values:

The three charts show very different patterns. Before activation (mlp_pre), the distribution is somewhat Gaussian centered, not far away from zero. This is a healthy shape; it means inputs are spread across both negative and positive values, allowing the activation function to “decide” which neurons to switch off. If this distribution were strongly shifted (far from zero), the nonlinearity could saturate, leading to more dead neurons. Luckily, this is not the case for our GPT-2.
The mlp_post histogram shows a strong spike at zero with a long right rail. This suggests that most activation outputs fall close to zero. Those that are too close are effectively dead, which corresponds to our insights from the heatmap analysis. A small fraction of inputs produce large positive activations (visible in the tail). These neurons fire selectively on rare but important contexts.
The sharp spike around zero in the self-attention outputs (attn_out) suggests that attention outputs are sparse: many tokens receive little signal from attention heads. Occasional larger and smaller values reflect strong attention weights when the model attends to a key token. This sparsity is consistent with how attention should behave: most queries ignore most keys, but a few connections dominate.
Dead neuron ratio
Let us now examine the ratio of dead neurons, visualized as a line chart:

The Y-axis on this chart indicates the percentage of neurons that are dead, while the X-axis corresponds to the six model layers, indexed from 0 to 5.
This visualization confirms our findings from the heatmap analysis. The dead ratios are very low overall. Even in mlp_post, 99.9% of neurons are doing something on at least some tokens. This is extremely healthy. In a larger foundation model, we would be likely to see higher dead ratios.
Equipped with a visualization toolbox to discover dead neurons, let’s discuss a few approaches to prevent them. The next section covers selecting activation functions, and the topic of the following section is reviving inactive neurons.
Alternative activation functions
As we have mentioned before, if gradients in the network get too small, they tend to “vanish”, pushing the surrounding neurons into a state of inactivity. Consequently, one can prevent neurons from dying by ensuring the gradients do not vanish. One way to achieve this is with the right selection of activation functions.
Common activations
Those who pre-train or fine-tune foundation models have the freedom to select the activation functions to be used throughout the network. This choice typically constitutes a trade-off between computation speed and the ability of the activation to prevent neurons from dying.

ReLU is the fastest one to compute. However, it’s also very likely to produce dying neurons since it outputs zeros for any negative input. If the network’s weights end up in a state where the inputs to ReLU are consistently negative, then the entire ReLU-activated neuron keeps producing zeros. This is the main reason why ReLU is rarely used as anything other than a baseline.
Leaky ReLU adds a small but non-zero slope for negative values, decreasing the likelihood of the neurons dying. Exponential ReLU (ELU) has another desired characteristic. Just like Leaky ReLU, it has non-zero gradients for negative inputs. Unlike Leaky ReLU, however, ELU is smooth around zero, speeding up training convergence. The downside is that ELU is relatively slow to compute.
A couple of other activities inspired by ELU claim to improve on it. Gaussian Error Linear Unit (GELU) weights its inputs by their value instead of simply thresholding by the sign, which has been found to lead to better model performance. Swish (also known as SiLU, e.g., in PyTorch) is similar to GELU in shape, but it has been specifically designed and evaluated to serve as a drop-in replacement for ReLU in any neural network.
A quick literature search reveals many more state-of-the-art activations, such as SELU or Mish. The natural question arises: how to choose one in the context of large foundation models susceptible to dying neurons?
How to choose activation functions for foundation models
Training deep neural networks is a profoundly experimental endeavor. A typical approach to hyperparameter tuning in deep learning models is to perform a random or Bayesian search over the hyperparameter space and select a combination that results in the best outcome (such as accuracy, convergence speed, or whatever it is that we care the most about).
While the large amount of resources required to train a foundation model makes exploring a large hyperparameter space infeasible, we can still apply a somewhat similar approach to pick the activation function in foundation models, while optimizing for neuron liveness.
How do foundation model teams plan and budget their training runs?
The scale of infrastructure and amount of energy required to train a foundation model depend on its size and architecture. In turn, the specific hardware constrains size and architecture, with the GPU memory as a key restriction. Further, larger models generally need more training data, leading to longer training times.
Foundation model teams typically solve this chicken-and-egg problem by defining a compute budget beforehand. As a general rule of thumb, about a fifth of this budget can be spent on the main training run, with the remainder needed for experimentation and test runs.
The main run, which is training the model at full scale, often spans several weeks. Simultaneously, foundation model teams launch experimental runs on the side that are short and use a smaller model variant. The teams use these experimental runs to explore new architectures, hyperparameters, or training schedules. They closely monitor for promising early signals, and once they identify beneficial shifts in metrics, they incorporate these findings into the main training run.
-
Read more about how teams are implementing this iterative approach and other topics in Neptune’s 2025 State of Foundation Model Training Report.
Given a model that we wish to train, we can iteratively swap activation functions in its architecture and for each, compare the rates of dead neurons empirically, as we have seen it done before using simple line charts. Consider the visualization below, which you can also view in the interactive mode in this Neptune project. I used this Python script to swap the activations, collect dead neuron ratios, and log them into Neptune.

We are again looking at ratios of dead neurons in Tiny GPT-2, shown on the vertical axis. Each line corresponds to one of the activation functions described above. The horizontal axis corresponds to the subsequent model layers. Note that compared to the similar chart we have seen before, here the threshold for considering a neuron “dead” has been decreased slightly to show differences between the activations more prominently.
The comparison reveals substantial differences:
- Unsurprisingly, ReLU (orange) and Leaky ReLU (green) consistently show the highest dead neuron ratios, confirming their tendency to permanently silence neurons.
- GELU (blue) maintains much lower dead ratios across layers, reflecting why it has become a popular default in modern Transformers (starting with BERT; before that, Vaswani’s original transformer used ReLU).
- Swish (purple) and ELU (red) tend to work best in our experiment, with near-zero ratios of dead neurons.
This type of experiment makes the trade-offs concrete: while the original Tiny GPT-2 architecture uses GELU activations, this choice seems to be suboptimal as far as the dead neurons are concerned. Swapping the activations to Swish results in a smaller fraction of the network being silenced.
In practice, this means we don’t have to guess: by logging dead neuron ratios across different activations during pilot runs, we can quantitatively compare how much “neuron death” each option induces, and then choose the activation that works best.
Reviving inactive neurons
So far, we have discussed how to detect dying neurons and prevent the phenomenon. Let’s now take a look at how to revive the neurons back to live once they are dead.
An interesting approach to achieve this is with the so-called synaptic stripping, a method introduced by Colorado State University researchers Whitaker and Whitley in their 2023 paper “Synaptic Stripping: How Pruning Can Bring Dead Neurons Back To Life”.
As we have seen before, dead neurons arise once their weights shift into a state where no reasonable input produces a non-zero output. Since the gradient is also zero in this regime, those neurons can’t recover through normal backpropagation, effectively reducing the model’s capacity.
The Synaptic Stripping method introduces a clever solution inspired by biology. In neuroscience, synaptic stripping describes a process where immune cells scan the brain, detect dysfunctional synapses, and remove them so that neurons can recover and reconnect. The paper’s authors propose a similar mechanism for deep learning. Here’s the key idea:
- Step 1: Detect dead neurons. After each training epoch, look at the activation outputs on a validation set. If a neuron produces a total activation of zero across the dataset, it’s considered dead.
- Step 2: Prune negative weights. For each dead neuron, remove (zero-out) a fraction of its most negative incoming weights. This shifts the neuron’s weight distribution toward positive values.
- Step 3: Resume training. With the problematic synapses stripped away, previously dead neurons regain the ability to fire and re-enter the optimization process. Training continues, with the cycle repeated after each epoch.

As the authors observe, paradoxically, removing parameters in this way can increase effective model capacity. Dead neurons are not contributing to the computation anyway, so pruning the connections that keep them locked in silence gives them a chance to become useful again.
In experiments on vision transformers and MLPs, Synaptic Stripping increased effective model capacity by up to 30%, improved generalization, and reduced model size. An important benefit of this approach is that it is easy to implement, and it can be slotted into any existing training loop.
What does this mean for foundation model training?
In a series of small-scale experiments, we explored the phenomenon of dead neurons in foundation models: what they are, why they matter, and how to both detect and mitigate them. We discussed how dead neurons not only waste computation and memory but also silently reduce effective model capacity.
Through simple visualization techniques, such as activation heatmaps, histograms, and dead neuron ratios, we can make the problem visible. From there, we compared activation functions to see which ones are more prone to killing neurons, and we examined Synaptic Stripping as a practical way to revive neurons that would otherwise stay permanently inactive.
An important takeaway from our discussion is that neuron health should be part of the standard toolkit when building and evaluating foundation models. Here are some concrete steps to integrate this into your workflow:
- Run regular neuron activity audits during training. Just like you track loss curves or learning rates, log dead neuron ratios per layer. This gives early visibility into whether parts of the model are shutting down.
- Set up automated alerts. For example, trigger a warning if more than some percentage of neurons in any layer are dead. This allows you to intervene, for instance, by adjusting activations or applying techniques like Synaptic Stripping.
- Benchmark neuron health across experiments. When testing new model variants, track dead neuron ratios alongside accuracy metrics. This makes “neuron liveness” a first-class metric for comparing design choices, not just an afterthought.
Foundation models are expensive to train and serve. Making neuron health measurable and actionable is a way to get more out of every GPU-hour while also improving model robustness and generalization.