Neptune Blog

How to Optimize LLM Inference

12 min
29th December, 2025

TL;DR

The memory required to run a model with hundreds of billions of parameters far exceeds the capacity of even the largest available GPUs.

Maximizing GPU utilization throughout the inference process is key to efficient LLM operation.

The attention mechanism is the main focus of optimization efforts, as it scales the least favorably. While key-value caching reduces computational load, multi-query and grouped-query attention reduce both the number of parameters and the cache size.

By employing effective workload-parallelization strategies, we can efficiently run LLMs that are larger than a single GPU can handle.

Large Language Model (LLM) inference at scale is challenging as it involves transferring massive amounts of model parameters and data and performing computations on large tensors. Coupled with the low-latency needs of many applications, we are forced to push the hardware to its limits, in memory bandwidth (measured in Bytes/s) as well as compute capability (measured in FLOPs, short for “floating point operations per second”).

Have you ever wondered how LLM providers like OpenAI, Hugging Face, and Anthropic get an answer back to you this quickly, given that they are processing millions of requests concurrently? In this article, we’ll explore the characteristics of LLM inference as a computational workload and discuss approaches such as key value caching, quantization, and various types of parallelization.

Understanding the LLM workload at inference

Generally, all LLMs follow the same schema: embedding input tokens, then processing the embeddings in N equal transformer blocks, before transforming the output back into the input space and sampling from the resulting probability distribution.

In the following, we’ll use the Llama model family architecture as a specific example to understand the LLM workload at inference.

llama model architecture
Llama model architecture. The input tokens are converted into embedding vectors and run through N transformer blocks. In the end, the intermediate output is normalized and transformed again to match the vocabulary size. All N Llama transformer blocks are functionally the same, but have different weights. The blocks feature Rotary Positional Encodings and Grouped Multi-Query Attention. Key-value caching is used to optimize the attention mechanism. | Source

The following table shows the number of floating-point operations (FLOPs) required for computing the output of a Llama transformer block. s is the sequence length, b the batch size, and dmodel the model’s hidden dimension. The feed-forward layer has an inner dimension dFFN.

OperationFLOPs
Q, K, V projections3 *b* s* dmodel * dmodel
Feed forward3*b* s *dmodel* dFFN
Attention2 *b *s2 * dmodel

We see that the FLOPs of the Q, K, and V projections, as well as the feed-forward layers, increase linearly with the sequence length s and dominate the FLOPs for short sequences (s < dmodel, s < dFFN). Matrix multiplications dominate the attention block’s FLOPs. (The softmax FLOPs are negligible and not shown.) Calculating the attention dominates the computation for long sequences, scaling quadratically with the sequence length s.

During autoregressive generation, to obtain the next token, we need to process the entire sequence. Thus, the Q, K, and V projections and the feed-forward layers scale as O(s2), whereas the attention scales as O(s3). The attention computation dominates the overall scaling and becomes intractable even for modest sequence lengths. Thus, it is the focus of optimizations.

The memory required to store the model weights depends on the precision at which they’re stored. Common floating point precisions are FP8 (8 bits), FP16 (16 bits), and FP32 (32 bits). Therefore, we need approximately 16 GB of memory to store the eight billion parameters of a Llama 3.1 8B model in FP16 precision. The 400-billion-parameter Llama 4 Maverick model requires 800 GB at the same precision, exceeding the capacity of the largest available GPUs by a wide margin. Hence, managing and potentially reducing memory demands is another important area of LLM inference optimization.

These back-of-the-envelope numbers will suffice for our exploration of LLM inference optimization. For a far more detailed analysis of the LLM workload at inference, see the chapter All About Transformer Inference in the book How to Scale Your Model, published by Google DeepMind.

A quick primer on hardware for LLM inference

A typical LLM inference cluster consists of several nodes, each with a multi-core CPU and multiple accelerator devices, commonly GPUs. The GPUs are performing the actual tensor computations, while the CPU is handling data transfer and inter-node communication.

Each GPU executes instructions independently but can synchronize and communicate with others through collective operations such as AllReduce, Gather, or Scatter. The GPUs are connected with high-speed interconnects, enabling them to communicate directly, without needing to go over the CPU. The bandwidth varies between different hardware. For example, Nvidia GPUs communicating over NVLink reach up to 1.8 TB/s in its 5th generation.

The primary building blocks of a GPU are streaming multiprocessors (SMs) that handle parallel computation. Each SM is designed to execute many threads concurrently. On Nvidia’s H100, which we’ll use as our reference, there are up to 144 SMs (the precise number depends on the board’s form factor).

Each SM comprises:

  • CUDA cores: Execute standard floating-point and integer arithmetic operations. A H100’s SM contains 128 FP32 CUDA cores.
  • Tensor Cores: Specialized cores for matrix-multiply and accumulate operations. These handle the vast majority of operations. On the H100, there are four Tensor Cores per SM.
  • Warp schedulers: Manage groups of threads called “warps” (32 on the H100) and issue instructions to CUDA cores and Tensor Cores. The Warp schedulers operate in a SIMT (Single Instruction, Multiple Threads) manner, which means that in a given cycle, each “warp” performs the same operation.
  • L1 Cache: Low-latency memory local to each SM. On the H100, the L1 cache per SM is roughly 256 KB.

All SMs share:

  • L2 Cache: Larger and slower than the L1 cache, but significantly faster than the HBM and shared between all SMs. The H100 has an L2 cache between 50 MB and 60 MB with about 5.5TB/s full-duplex bandwidth (i.e., this bandwidth can be reached simultaneously in both directions).
  • High-Bandwidth Memory (HBM): Off-chip memory shared across all SMs. H100s have 80 GB of HBM and a bandwidth between Tensor Cores and HBM of 3.35TB/s.

The HBM is connected to the CPU’s main memory, which can be substantially larger, but the communication bandwidth is about an order of magnitude smaller.

Again, for a more detailed analysis, see the chapter How to Think About GPUs in Google DeepMind’s How to Scale Your Model book.

simple gpu server
A diagram of a simple GPU server with two GPUs communicating through a high-speed interconnect, each with its own HBM. They are connected to a CPU through a bus.
gpus sram pyramid
The pyramid shows how much faster the GPU’s SRAM is compared to HBM or even DRAM on the CPU. Because the SRAM is small and fast, while HBM is big but relatively slow, we want to limit the amount of memory access to HBM. | Source

The main challenge when working with accelerators is maintaining their utilization. This often arises due to data transfer overheads between CPU and GPU, limited GPU memory capacity restricting model size, and mismatched workloads where computational tasks do not fully leverage the GPU’s parallel processing capabilities. Addressing these issues requires workload balancing, optimized memory management, and efficient communication pipelines.

What does the hardware and data infrastructure for foundation model training look like?

Graphics processing units (GPUs) are the default choice for foundation model training. They are the core building blocks of today’s high-performance computing (HPC) clusters, as they provide unmatched performance on parallelizable computations. Maintaining and efficiently utilizing this hardware platform is a major challenge.

The scale of infrastructure and amount of energy required to train a foundation model depend on its size and architecture. In turn, the specific hardware constrains size and architecture, with the GPU memory as a key restriction. Foundation model teams typically solve this chicken-and-egg problem by defining a compute budget beforehand.  As a general rule of thumb, about a fifth of this budget can be spent on the main training run, with the remainder needed for experimentation and test runs.

Optimizing the attention mechanism

Since the attention mechanism scales quadratically with the sequence length s, it dominates the computation. During autoregressive generation, we need to compute the attention for all of the previous tokens in every iteration, leading to O(n3) scaling.

attention computation
Attention computation for an input with nine tokens. The query matrix Q is multiplied by the transposed key matrix KT, producing a large QKT matrix of dimensions (squery, skey). We take the softmax of this matrix and multiply it by the values matrix V. The output is the attention scores tensor. | Source

Key-value caching

Let’s look at the attention computation in more detail: For every next token, the Q, K, and V matrices will add a new row and column, and the QKT matrix will gain an additional row and column as well. The important part: all other rows and columns stay the same because their queries and keys haven’t changed.

To generate new tokens, we only need to compute the attention of the latest query to all previous tokens, whose information is encoded in the K and V matrices. Only the last rows (tensors) in the K and V matrices are new, while all others have already been computed in previous iterations. Thus, we can cache these tensors at runtime, an optimization known as key-value caching (KV caching).

generating the 11th token
Generating the 11th token. The purple rectangles show new information compared to the previous iteration. The grayed-out upper triangular part of the QKT matrix is masked out in causal attention because all queries attend only to the previous tokens, not the future ones. Softmax is performed row-wise. | Source (modified)

Furthermore, all data from previously generated tokens—except for the K and V matrices—is redundant. In every iteration, we only need to consider the latest token and compute its attention over all previous tokens.

self-attention
Self-attention using KV caching during the generation of the fourth token. Three tokens have already been processed, and their K and V entries can be reused (grayed-out tensors). Only the latest query is needed. | Source (modified)

If we load K and V from a cache, we can pass just the latest token into the model. Only the latest query tensor is used to produce a single attention score. This improves the scaling of autoregressive generation to O(sequence_length2).

However, this does not come for free: KV caching increases memory usage linearly with the sequence length s, as we now need to store instead of compute the K and V matrix entries for the previous tokens.

When using KV caching, we can distinguish two phases of LLMs’ operation:

  • Prefill phase: The model processes the initial input tokens (e.g., a user’s prompt). It computes the K and V matrices for all tokens in the input sequence simultaneously. During this phase, all input tokens are processed, and the KV cache is populated.

    In the prefill phase, we are usually compute-bound because we can compute the attention for all input tokens together in a single forward pass, leading to big matrix multiplications for which modern accelerators are optimized.
  • Decode phase: After the prefill phase, the model generates tokens one at a time autoregressively. At each decoding step, a single token comes in, and a single token is predicted. For all the previous tokens, we reuse the cached keys and values.

    Now, the query is an embedding of only a single token at a time, leading to a much lower computational intensity. Instead, we spend more time moving data around, e.g., loading K and V from the cache and moving the weights and activations from high-bandwidth memory (HBM) to GPU SRAM (the memory closest to the compute units). Thus, we are memory-bound.

For the overall application runtime, it is generally better to be compute-bound than memory-bound. Not fully utilizing the compute capacity means wasting power, as even if cores are idle, they still draw power. Also, if we are compute-bound, we can scale the number of devices to speed up.

Efficient attention mechanisms

We’ve shifted from compute-bound to memory-bound. KV caching cuts FLOPs per step, but the computation of attention now spends most of its time moving and storing K/V states. The next wins come from reducing what we keep in memory and how we touch it compared to vanilla Multi-Head Attention (MHA):

  • Multi-query attention (MQA) and Grouped-query attention (GQA) lead to fewer parameters and a smaller KV cache. MQA shares a single K/V across all heads, minimizing parameters and cache size (lowest memory consumption with a possible quality hit). GQA shares K/V within groups of heads, landing between MHA and MQA (better quality/memory balance).
  • Flash Attention is an optimization for faster and leaner memory access. It reorganizes the attention computation into tiled blocks that live in on-chip memory, slashing reads/writes to HBM. It does the same math but causes far less memory traffic. FlashAttention is orthogonal to MQA/GQA—pair it with any of the above to reduce memory access overhead.
visualization mha, gqa, mqa
Visualization of MHA, GQA, and MQA (left to right). In MHA, every head calculates its own KV pair. MQA all heads share a single KV pair, and GQA sits in between–groups of attention heads share the same KV. | Source
flash attention algorithm
The flash attention algorithm. The core problem of standard attention is many accesses to the slow HBM memory. The pyramid on the left shows how much faster the GPU’s SRAM is compared to HBM or even DRAM on the CPU. Because the SRAM is small and fast, while HBM is big but relatively slow, we want to limit the amount of memory access to HBM. The core of the flash attention algorithm is using tiling to fuse multiple operations and thereby reduce the slow HBM accesses. This is enabled by using an online (tile-based) softmax algorithm. Tiles of the KTV matrices are loaded into SRAM in the outer loop (red arrows). They are reused for all rows of Q, which stream in the inner loop (blue arrows) to compute the softmax without materializing the full attention matrix in HBM. The plot on the right shows the runtime speedup of flash attention over regular attention. | Source

Leveraging FlashAttention, the big QKT attention matrix must never be fully materialized, leading to a big memory reduction.

memory reduction graph
The memory reduction of the FlashAttention kernel compared to PyTorch’s standard attention (at the time of publication) for increasing sequence lengths. FlashAttention benefits both prefill and decode with long sequence lengths. During decode with KV caching, when we only compute the attention of one token, its benefits are less pronounced, but still improve for sequence lengths spilling over SRAM and big batches. | Source

Parallelizing the inference workload

The LLM inference workload can be parallelized in many orthogonal ways across devices. They can be used together or individually, depending on the scenario and the infrastructure. 

The simplest kind of parallelism is data parallelism. We create multiple model replicas on different devices and feed different inputs to them. This approach is ideal for processing large datasets with smaller models that fit onto a single device. For example, in a chatbot application, different users’ chats can be sent to different model replicas.

The other two common parallelism techniques used in LLM training and at inference are tensor and pipeline parallelism, because they allow us to scale up large models that wouldn’t fit on a single GPU across many devices.

Using some X parallelism techniques at once is commonly dubbed “XD parallelism”.

Tensor parallelism

Tensor parallelism (TP, also known as model parallelism or horizontal parallelism) was introduced in the 2020 MegatronLM paper to alleviate the memory bottlenecks of the large linear layers in the feed-forward block.

The linear layers’ weights are split („sharded“) across devices such that each device does a subset of computations. Tensor parallelism regulates the needed memory bandwidth because every device only needs to load a slice of the weights.

parallelization of matrix multiplication
Row- and column-wise parallelization of matrix multiplication. In column parallelism, the full input X is multiplied by a subset of the columns of the second operand, each producing a subset of complete output columns. In row parallelism, a subset of the columns of X is multiplied with a subset of the rows of Y, each producing the partial results for all output channels, which must be added together for the full result. | Source

Generally, a linear layer (i.e., a matrix multiplication) can be parallelized column-wise or row-wise:

  • In column parallelism, the weights are split column-wise, and the input is copied to all devices. Performing the computation on the tiles produces output columns that must be concatenated together.
  • In row parallelism, the weights are split row-wise, and the input must be split column-wise. After the tiled matrix multiplications are finished, the output matrices must be summed up („reduced“).

In LLMs, both row- and column-wise parallelisms are used together. For example, the feed-forward blocks in Llama 3 consist of three linear layers, w1, w2, w3, and an activation function (SiLU):

 def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

Matrices w1 and w3 project the input x into a higher intermediate dimension, and w2 projects the intermediate tensor back to the original dimension.

For example, a Llama3-8B model has a model dimension of 4096 and an intermediate dimension of 14336. To parallelize this computation, we can parallelize w1 and w3 column-wise, each device producing a subset of the channels. Each device performs the SiLU activation and the elementwise multiplication on its shard of the data. The w2 matrix is then sharded row-wise such that the subset of the channels is down-projected again. Then, each device performs the whole forward pass on only a part of the data. In the end, all shards are summed up.

tensor parallelism example
tensor parallelism
Two examples of tensor parallelism. The upper figure shows the parallelization of the feed-forward block, and the lower one of the attention heads. f is an identity operation, and g is an all-reduce operation. The input X is distributed to each device, which, in the first step, calculates a subset of the output channels (Y1 and Y2). In the second step, these are used to compute partial results for all channels that are then combined by g. | Source

The degree of parallelism, which is the number of devices to parallelize over, has to be tuned to achieve maximum device utilization. TP=1 means no parallelism, and TP=4 (also called “4-way parallelism”) means that the matrices are split into four shards.

The decisive factor in optimizing the degree of tensor parallelism is the communication overhead between devices. The shards must first be distributed („scattered“) across devices, and „gathered“ or „reduced“ in the end.

The guiding principle is keeping devices busy with computations: Scale TP until compute time dominates transfer time for the given batch size, memory capacity, and link bandwidth.

Pipeline parallelism

In pipeline parallelism (PP, also known as vertical parallelism), different layers are assigned to different devices. The intermediate activations flow from one device to another.

Like tensor parallelism, PP can be used to alleviate memory capacity issues. For example, a Llama3 405B (910 GB of parameters) can be split across 64 Nvidia T4 GPUs, each with just 16 GB of VRAM, totaling 1 TB.

The main challenge of PP is scheduling the workload such that idle periods (called “bubbles”), where a device waits for the output of another device, are minimized. Such regions can be discovered by profiling the workload.

pipeline bubbles in model training
Example of pipeline bubbles in a 4-stage pipeline parallelism in model training. The model is split layerwise over 4 devices, represented by the colors (gray, yellow, blue, red). The squares that are in the same vertical line are computed at the same time, e.g., F1,0 and F0,1. F denotes the forward pass, and B the backpropagation (in training). In the top sketch, the pipelines are computed completely sequentially, leading to empty regions, called pipeline bubbles. We can reduce the size of the bubbles by splitting the input mini-batch into several micro-batches (four in this diagram). Different micro-batches are computed in parallel over the devices. While the example shown is for training, the concept applies all the same for inference. | Source

To reduce the idle time, the communication between devices has to be optimally overlapped with the independent computations that can run in parallel.

Other parallelisms

Beyond tensor and pipeline parallelism, two other types of parallelism are commonly utilized for LLM inference:

  • In “sequence parallelism,” long input sequences that require more memory than a single device provides are split across devices, so that each computes the attention scores for only a subset of the total input tokens. While this enables inference on longer sequences than a single device could handle and keeps most computations local, it requires substantial synchronization effort.
  • “Expert parallelism”, specific to the mixture of experts architecture (MoE), distributes the “experts” across devices. During runtime, the model dynamically routes the inputs to the appropriate experts. For example, the DeepSeek-V3 model with 64 experts per layer uses 64-way expert parallelism across 8 devices, meaning each device gets 8 experts.

Quantization

Another way of reducing the memory and compute bottlenecks is by using fewer bits for the weights and activations. This is called quantization. The lower the bitwidth, the more memory we save. However, this comes at the risk of degrading the model’s output accuracy.

The numeric data types used in neural networks are integer (INT) and floating point (FP), and logarithmic data types.

IEEE FP16 and BF16 are two prominent floating-point data formats using 16-bit. BF16 (“brain float”) was developed by Google Brain (now part of Google DeepMind) and retains the same dynamic range as FP32, but sacrifices precision and cannot represent very small values as accurately.

The bit-width of the data type used is the parameter that directly affects its memory usage. An IEEE 754 FP32 takes up 4 Bytes per value. Replacing this with an FP16 data type, we can immediately save half of the memory needed. Furthermore, if we are memory-bottlenecked (e.g., in the decode phase), quantization frees up the memory bandwidth, directly leading to runtime improvements.

Beyond the memory savings, quantized data formats can also speed up the computation if the hardware supports it.

For example, matrix multiplication is a common bottleneck in LLM models. At its core, matrix multiplication is a series of multiplications and accumulations, which, on hardware, is computed using multipliers and accumulators with a certain bit-width, e.g., 32 bits. Memory transfers and the compute capabilities of the hardware are optimized for this bit-width.

However, since 2017, when Nvidia introduced the Volta architecture, hardware vendors have made optimizations for native support of lower-bandwidth matrix multiplication workloads present in ML models. AMD calls these „Matrix cores” and Nvidia „Tensor cores“. The table below shows a comparison of theoretical FLOPS for AMD’s MI300X and Nvidia’s H200 NVL (PCIe version) using these specialized cores. You can see that halving the bit-width doubles the available FLOPS.

TFLOPSAMD MI300XNVIDIA H200 NVL
TF32653835
FP1613071671
FP826143341

Quantization techniques

Model quantization can significantly improve efficiency, but it often comes with a tradeoff in output quality, as reducing bit-width means reducing the amount of information that can be represented. When applying quantization, it is essential to test its effects on realistic data to assess whether the increase in computational efficiency merits the drop in task performance.

Quantization techniques are distinguished by:

  • When quantization happens: during training (Quantization-Aware Training, QAT) or after training (Post-Training Quantization, PTQ).
  • How scaling and outliers are handled to avoid range clipping and reduce quantization errors.
  • How quantization parameters are determined: statically (offline, fixed) or dynamically (online, at runtime).

Quantization-Aware Training (QAT) is applied during training while parameters are being updated. A common example is training an LLM in BF16. In Post-Training Quantization (PTQ), the model is already trained, and the process relies on a calibration dataset to quantize it, e.g., set parameters such as scaling factors, per-layer bit-widths, and group sizes.

Scaling plays a critical role in avoiding range-clipping errors. For instance, the maximum representable value in FP16 is roughly 65,000, while a commonly used FP8 format tops out around 448. Converting directly from FP16 to FP8 would clamp anything above that limit, introducing large errors. Scaling the values before quantizing, performing the computation in FP8, and then rescaling afterwards preserves more of the model’s dynamic range.

The following example (adapted from this Gist by Nikita Shulga) shows how two FP16 tensors can be scaled and quantized before an FP8 matrix multiplication:

# a, b are FP16 tensors
scale_a = max_fp8 / abs_max(a)
scale_b = max_fp8 / abs_max(b)

fp8_a = clamp(a * scale_a)
fp8_b = clamp(b * scale_b)

y, _ = torch._scaled_mm(fp8_a, fp8_b, out_dtype=torch.float16, 1/scale_a , 1/scale_b)

The timing of when quantization parameters are determined matters as well. In static quantization, parameters are computed offline using a calibration dataset. This has no runtime overhead, but the quality can degrade if the actual runtime data differs from what was seen during calibration. For example, larger runtime values can cause clipping if the scaling is insufficient. In dynamic quantization, parameters are computed at runtime, allowing the system to adapt to changing data distributions at the cost of extra computation. Using the earlier example, dynamic quantization would mean recalculating the scaling factors every time the tensors are quantized.

Making (activation) quantization work

Until now, we haven’t differentiated between weights and activations when discussing quantization.

It turns out that quantizing weights is much simpler than quantizing the activations. Weights are static, so we can quantize them offline. Furthermore, due to the use of regularization that penalizes large weights during training, weights typically have distributions with small amplitudes.

In contrast, LLM activation tensors have outliers. Outliers are channels with high absolute values, which are difficult to quantize because they have a big impact on the scaling factor. We divide the numbers in a tensor by the maximal value of that tensor. If this value is much larger than the other values, the division can push the other values out of the representable range.

outliers in the channel and token dimension
Outliers in the channel and token dimension of an LLM layer. The figure shows the outlier values for some channels in a linear layer. The outliers have much higher absolute values than the rest, making them hard to quantize. Here, these are channels ~500, 2000, and 5000. The insight here is that channel-wise outliers occur for all tokens of that channel. | Source
percentage of layers or tokens
The percentage of layers or tokens with outliers compared to the number of parameters. The figure shows that the bigger the model, the more such outliers there are. | Source

Outliers in activations can be handled by leveraging the observation that outliers aren’t random but occur in the same channel for all input tokens. We can split the channels into “outlier” and normal channels and use different scaling factors to quantize them. We can even split the layer and calculate the outliers in full precision, and only quantize the rest.

Conclusion

In this article, we have explored ways of optimizing LLM inference. KV caching is used to avoid recomputing K and V matrices, while advanced attention mechanisms, like Flash Attention, accelerate the attention process. To alleviate memory bottlenecks, we can quantize the model’s parameters or parallelize it across devices in different ways. If our hardware supports calculation in lower bit widths, e.g., FP8 matrix multiplication, we get an additional speed-up. On top of all that, continuous batching and speculative decoding enable efficient deployment.

By combining these approaches, you can unlock faster and more resource-efficient LLM inference in your application, serving more users better.

Was the article useful?

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.