Neptune Blog

Multi GPU Model Training: Monitoring and Optimizing

Mirza Mujtaba

8 min

6th May, 2025

ML Model Development

Do you struggle with monitoring and optimizing the training of Deep Neural Networks on multiple GPUs? If yes, you’re in the right place.

In this article, we will discuss multi GPU training with Pytorch Lightning and find out the best practices that should be adopted to optimize the training process. We shall also see how we can monitor the usage of all the GPUs during the training process.

Let’s start by covering some basics.

What is distributed training with GPUs?

Sometimes for complex tasks such as in Computer Vision or Natural Language Processing, training a Deep Neural Network (DNN) involves solving gradient descent for millions or billions of parameters and therefore it becomes a computationally complex process that might take days or even weeks to complete.

For instance, Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model which has 175 billion parameters and it would take approximately 355 years to train it on a single NVIDIA Tesla V100 GPU. But if the same model is trained in parallel using 1024 NVIDIA A100 GPUs, we can estimate the training time to be around 34 days. As a result, parallel training on GPUs is a widely used method for speeding up the process nowadays.

Check also

Distributed Training: Frameworks and Tools

To get a clear understanding of how we are able to use multiple GPUs for training a Deep Neural Networks, let us briefly see how Neural networks are trained:

Deep Neural Network models are typically trained using Mini Batch Gradient Descent in which training data is randomly sampled into mini-batches.
Mini-batches are fed into the model to traverse the model in two phases:
- forward pass
- backward pass
The forward pass generates predictions and calculates the loss between the prediction and the ground truth.
The backward pass passes the errors through the layers of the network (known as backpropagation) to obtain gradients to update model weights.
A mini-batch passing through the forward and backward phases is referred to as an iteration, and an epoch is defined as performing the forward-backward pass through the entire training dataset.
The training process goes on for multiple epochs until the model converges.

If this seems overwhelming to you, I would suggest this article to get a more in-depth knowledge of how a neural network is trained.

To speed up the training procedure, we use multiple GPUs for parallelizing the training process and Data parallelism and Model parallelism are the two techniques used for the parallelization task.

Data parallelism

In Data parallelism each GPU holds a copy of the model and the data is divided into n number of partitions where each partition is used to train a copy of the model on each GPU.

When asynchronous data parallelism is applied, the parameter server is in charge of weight updates. Each GPU sends its gradients to the parameter server, which then updates weights and sends the updated weights back to that GPU.

In this way, there is no synchronization among GPUs. This approach solves the unstable networking issue in distributed computing environments but it introduces the inconsistency issue. Moreover, this method does not reduce the number of data transfers among GPUs.

Model parallelism

Model parallelism partitions a model among multiple GPUs, where each GPU is responsible for the weight updates of the assigned layers of a model. Intermediate data such as outputs from a layer of a neural network for the forward pass and gradients for the backward pass are transferred among GPUs.

Since these partitions have dependencies, in a naive implementation of model parallelism, only one GPU is active at a time, leading to low GPU utilization. To enable parallel execution, pipeline parallelism splits the input minibatch into multiple micro-batches and pipelines the execution of these micro-batches across multiple GPUs. This is outlined in the figure below:

The above figure represents a model with 4 layers placed on 4 different GPUs (vertical axis). The horizontal axis represents training this model through time demonstrating that the GPUs are utilized much more efficiently. However, there still exists a bubble (as demonstrated in the figure) where certain GPUs are not utilized.

To have a complete picture of model parallelism and data parallelism, I would strongly suggest going through Distributed Training: Guide for Data Scientists.

Multi GPU training with PyTorch Lightning

In this section, we will focus on how we can train on multiple GPUs using PyTorch Lightning due to its increased popularity in the last year. PyTorch Lightning is really simple and convenient to use and it helps us to scale the models, without the boilerplate. Boilerplate code is where most people are prone to errors when scaling the models.

May be useful

Check how you can keep track of your PyTorch LIghtning model training.

There are some coding practices that can help you to migrate your code to GPUs without any problems. You should refer to the PyTorch Lightning documentation for more information about this.

Distributed modes

In this section, we will go through the different distributed modes provided by Pytorch lightning.

Data Parallel

We can train a model on a single machine having multiple GPUs. With the DataParallel (DP) method, a batch is equally divided among all selected GPUs of a node after which the root node aggregates all the results. But this method is not recommended by Pytorch lightning developers as it is not stable yet and you may see errors or misbehavior if you assign a state to the module in the forward() or *_step() methods.

# train on 4 GPUs (using DP mode)
trainer = Trainer(devices=4, accelerator="gpu",strategy="dp")

Distributed Data Parallel

DistributedDataParallel (DDP) works as follows:

Each GPU across each node gets its own process.
Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
Each process initializes the model.
Each process performs a full forward and backward pass in parallel.
The gradients are synced and averaged across all processes.
Each process updates its optimizer.

There are two ways in which we can use this method, viz ‘ddp’ and ‘ddp_spawn’. In the ‘ddp’ method, the script is called under the hood multiple times with the correct environment variables.

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(devices=8, accelerator="gpu",strategy="ddp")

# train on 32 GPUs (4 nodes)
trainer = Trainer(devices=8, accelerator="gpu",strategy="ddp", num_nodes=4)

Although it seems like a good choice in most cases, it has some limitations as it does not work in tools such as Jupyter Notebook, Google COLAB, and Kaggle. Also, it does not seem to work when having a nested script without a root package. In these cases ‘ddp_spawn’ method is preferred.

‘ddp_spawn’ is exactly like ddp except that it uses torch.multiprocessing.spawn() method to start the training processes. So one might think to always prefer ‘ddp_spawn’ method over ‘ddp’ but ‘ddp_spawn’ also have these limitations:

The spawn method trains the model in subprocesses and the model on the main process does not get updated.
Dataloader(num_workers=N), where N is large, bottlenecks training with DDP i.e. it will be VERY slow or won’t work at all. This is a PyTorch limitation.
This method forces everything to be picklable.

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(devices=8, accelerator="gpu",strategy="ddp_spawn")

The ‘ddp’ method should always be preferred over ‘ddp_spawn’ for speed and performance.

Distributed Data Parallel 2

DDP2 behaves like DP in a single machine but when used across multiple nodes, it acts as DDP. Sometimes It may be useful to use all batches on the same machine instead of a subset and ddp2 method may come handy in such cases. DDP2 does the following:

Copies a subset of the data to each node.
Inits a model on each node.
Runs a forward and backward pass using DP.
Syncs gradients across nodes.
Applies the optimizer updates.

# train on 32 GPUs (4 nodes)
trainer = Trainer(devices=8, accelerator="gpu",strategy="ddp2", num_nodes=4)

This technique is currently not recommended for usage because it is broken for all PyTorch versions >= 1.9, it is unclear how to make it work for PyTorch >= 1.9, and there are no functional tests for this method

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet and it makes distributed deep learning fast and easy to use.

Every process uses a single GPU to process a fixed subset of data.
During the backward pass, gradients are averaged across all GPUs in parallel.
Before moving on to the next phase, these gradients are synchronously applied before beginning the next step.
In the training script, Horovod will detect the number of workers from the environment, and automatically scale the learning rate to compensate for the increased total batch size.

Horovod supports single-GPU, multi-GPU, and multi-node training using the same training script. It can be configured in the training script to run with any number of GPUs / processes as follows:

# train Horovod on GPU (number of GPUs / machines provided on command-line)
trainer = Trainer(accelerator="gpu",strategy="horovod", devices=1)

# train Horovod on CPU (number of processes / machines provided on command-line)
trainer = Trainer(accelerator="gpu",strategy="horovod")

When starting the training job, the driver application will then be used to specify the total number of worker processes:

# run training with 4 GPUs on a single machine
horovodrun -np 4 python train.py

# run training with 8 GPUs on two machines (4 GPUs each)
horovodrun -np 8 -H hostname1:4,hostname2:4 python train.py

Sharded training

You may run into some memory issues when training large models or trying out larger batch sizes. One might think of using model parallelism for such situations but currently, due to complex implementations associated with it, we use sharded training instead.

Under the hood, Sharded Training is similar to Data Parallel Training, with the exception that optimizer states and gradients are sharded across GPUs.This method is highly recommended in multi-GPU setups where memory is constrained or when training larger models (500M+ parameter models).

To use Sharded Training, you need to first install FairScale using the command below.

pip install fairscale
# train using Sharded DDP
trainer = Trainer(accelerator="gpu",strategy="ddp_sharded")

There is a tradeoff between memory and performance when using Sharded training strategy as due to high distributed communication between devices, the training might get slower.

How to optimize training on multiple GPUs

When it comes to training huge models with large datasets on multiple GPUs, we might run across some problems with memory or performance bottlenecks. In this section, we will see how we can optimize the training process.

FairScale Activation Checkpointing

Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass. They are then re-computed for the backwards pass as needed. Activation checkpointing is very useful when you have intermediate layers that produce large activations.

FairScales’ checkpointing wrapper also handles batch norm layers correctly unlike the PyTorch implementation, ensuring stats are tracked correctly due to the multiple forward passes.

This saves memory when training larger models, however, requires wrapping modules you’d like to use activation checkpointing on. You can read more about it here.

from fairscale.nn import checkpoint_wrapper

class Model(LightningModule):
    def init(self):
        super().__init__()
        self.block1 = checkpoint_wrapper(nn.Sequential(nn.Linear(32, 32),
nn.ReLU()))
        self.block_2 = nn.Linear(32, 2)

Mixed Precision (16-bit) Training

By default, PyTorch, like most deep learning frameworks, uses 32-bit floating-point (FP32) arithmetic. Many deep learning models, on the other hand, may achieve complete accuracy with lower-bit floating-points such as 16-bit. As they require less memory, it is possible to train and deploy large neural networks which results in enhancing data transfer operations due to less memory bandwidth requirement.

But to use a 16-bit floating-point you must have:

a GPU that supports 16-bit precision such as NVIDIA pascal architecture or newer.
your optimization algorithm, i.e. the training_step should be numerically stable.

Mixed precision combines the use of both 32 and 16-bit floating points which increases the performance as well as gets rid of any memory issues we might be facing. Lightning offers mixed-precision training for GPUs with the native or APEX amp backend.

Trainer(precision=16, amp_backend="native")
# To use NVIDIA APEX amp_backend
Trainer(amp_backend="apex", precision=16)

Unless you require more refined control, it is recommended to always use the native amp_backend.

Prefer Distributed Data Parallel (DDP) over Data Parallel (DP)

As already mentioned in the previous section, we should prefer to use DDP strategy over DP. The reason behind this is that DP uses 3 transfer steps for each batch whereas DDP only uses 2 transfer steps thus making it faster.

DP performs the following steps:

Copy the model to the device.
Copy the data to the device.
Copy the outputs of each device back to the main device.

DDP performs the following steps:

Moving data to the device.
Transfer and sync gradients.

*GPU distributed data parallel strategy | Source*

Along with changing techniques and strategies, we can also make some changes in our code to optimize our training process, here are a few of them.

Increase Batch Size

If you are using a small batch size while training, more time is spent on loading and unloading of training data instead of computation operations which can decrease the training speed. Thus it is always advisable to use a bigger batch size to increase the GPU utilization. But increasing the batch size may have an adverse effect on the accuracy of the model, therefore we should experiment with different batch sizes to find an optimal batch size.

Load Data Faster with PyTorch’s DataLoader method

The DataLoader class in Pytorch is a quick and easy way to load and batch your data. We can use the parameter “num_workers” to load the data faster for training by setting its value to more than one. When using PyTorch lightning, it recommends the optimal value for num_workers for you.

But you might run into memory issues if your dataset is very large as loader worker processes and the parent process will consume the same amount of CPU memory for all Python objects in the parent process that are accessed from the worker processes. One way to avoid this issue is by using Pandas, Numpy or PyArrow objects in the Dataloader __getitem__ method instead of python objects.

Set Grads to None

Instead of setting the gradients to zero, we can increase the performance and speed by overriding the optimizer_zero_grad() method and setting it to None, this will in general have a lower memory footprint.

class Model(LightningModule):
    def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
        optimizer.zero_grad(set_to_none=True)

Model Toggling

This method is specifically useful when we have to perform gradient accumulation with multiple optimizers in a distributed setting. When performing gradient accumulation, setting sync_grad to False will block this synchronization and improve your training speed.

LightningOptimizer provides a toggle_model() function as a contextlib.contextmanager() for advanced users.Here is an example from the official PyTorch Lightning documentation.

# Scenario for a GAN with gradient accumulation every two batches and optimized for multiple GPUs.
class SimpleGAN(LightningModule):
    def __init__(self):
        super().__init__()
        self.automatic_optimization = False

    def training_step(self, batch, batch_idx):
        # Implementation follows the PyTorch tutorial:
        # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
        g_opt, d_opt = self.optimizers()

        X, _ = batch
        X.requires_grad = True
        batch_size = X.shape[0]

        real_label = torch.ones((batch_size, 1), device=self.device)
        fake_label = torch.zeros((batch_size, 1), device=self.device)

        # Sync and clear gradients
        # at the end of accumulation or
        # at the end of an epoch.
        is_last_batch_to_accumulate = (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch

        g_X = self.sample_G(batch_size)

        ##########################
        # Optimize Discriminator #
        ##########################
        with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
            d_x = self.D(X)
            errD_real = self.criterion(d_x, real_label)

            d_z = self.D(g_X.detach())
            errD_fake = self.criterion(d_z, fake_label)

            errD = errD_real + errD_fake

            self.manual_backward(errD)
            if is_last_batch_to_accumulate:
                d_opt.step()
                d_opt.zero_grad()

        ######################
        # Optimize Generator #
        ######################
        with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
            d_z = self.D(g_X)
            errG = self.criterion(d_z, real_label)

            self.manual_backward(errG)
            if is_last_batch_to_accumulate:
                g_opt.step()
                g_opt.zero_grad()

        self.log_dict({"g_loss": errG, "d_loss": errD}, prog_bar=True)

As you can see in the code, we set the sync_grad parameter as False and set it to True only at the end of an epoch or after every two batches. By doing this we are actually accumulating the gradients after every two batches or at the end of epoch.

Avoid .item(), .numpy(), .cpu() calls

Avoid .item(), .numpy(), .cpu() calls in the code. In case you have to remove the connected graph calls, you can use the .detach() method instead. This is because each of these calls would result in a data transfer from GPU to CPU and would bring the performance down dramatically.

Clear Cache

Every time torch.cuda.empty_cache() method is called, all the GPUs have to wait to sync. So avoid calling this method unnecessarily.

How to monitor training on multiple GPUs

It is essential to monitor the usage of the GPUs while training the model as it might give some helpful insights to improve the training and if the GPUs are underutilized we can deal with it accordingly. There are various tools such as neptune.ai and Wandb that can be used to monitor the training over multiple GPUs.

In this section, we will use neptune.ai for monitoring the GPU and GPU memory while training over multiple GPUs.

Monitor the training with neptune.ai

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

neptune.ai is a metadata store that can be used in any MLOps workflow. It allows us to monitor our resources while training so we can use this to monitor the usage of different GPUs used while training.

It is really simple to incorporate Neptune in the PyTorch Lightning code, all you have to do is to create a NeptuneLogger object and pass it to the Trainer object as shown below:

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import NeptuneLogger

# create NeptuneLogger
neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",  # replace with your own
    project="common/pytorch-lightning-integration",  # ""
    tags=["training", "resnet"],  # optional
)

# pass it to the Trainer
trainer = Trainer(max_epochs=10, logger=neptune_logger)

# run training
trainer.fit(my_model, my_dataloader)

If this is the first time you have come across Neptune, I would strongly recommend going through this step by step guide to install all the necessary libraries to make it work. After that, check the Neptune + PyTorch Lightning integration docs.

After running the file, you should get a link to the console. You can see the monitoring section (encircled in the image below) where you can see the usage of all the GPUs while training along with some other metrics.

GPU model training — *Monitor training on multiple GPUs | Source*

Let’s see what kind of meaningful insights we can infer from GPU utilization graphs.

GPU model monitoring — *Monitor training on multiple GPUs | Source*

As you can see above, the GPU usage is fluctuating and has some brief periods when they are not being used and it is not really an easy task to interpret the reason for this.
This may happen during validation because we do not compute gradients in this phase or this may be due to some other bottleneck, for example, you might be using some data-preprocessing technique on your data using CPU which can be really slow.
Also in some frameworks like Caffe, by default only one GPU is used during the validation phase so you might find only one GPU having high usage in that case.
So depending upon how you train the neural network, you might find a different graph indicating how the different GPUs are being utilized.

Conclusion

This article discusses why we train the machine learning models with multiple GPUs. We also discovered how easy it is to train over multiple GPUs with Pytorch lightning and the best ways to optimize the training process. Finally, we found how Neptune can be used to monitor GPU usage while training.

If you are looking to get a piece of in-depth knowledge on this subject, I would suggest going through this extensive resource Improving ML applications in shared computing environments or this research paper.

References

Was the article useful?

More about Multi GPU Model Training: Monitoring and Optimizing

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs