Distributed Training: Errors to Avoid
In this era of large language models (LLMs), monolithic foundation models, and increasingly enormous datasets, distributed training is a must, as both data and model weights very rarely fit on a single machine. However, distributed training in ML is complex and error-prone, with many hidden pitfalls that can cause huge issues in the model training process. Luckily, machine learning frameworks such as PyTorch Distributed Data Parallel and Ray have abstracted away many of the messy details, but it is still quite possible to have distributed training goes awry.
This article will touch on ten of the most common errors in distributed model training and will suggest solutions to each of them.
In a model parallel distributed training setup, each stage (excluding the first in the forward pass of training) is dependent upon the outputs of a layer located on a separate machine, and the same is true for the backward pass, except in the reverse order. This might not be the most efficient way of doing it. The following diagram from the PipeDream paper illustrates this point. All of the grey boxes in this schematic represent time units in which a machine is sitting idle and performing no useful work.
The idea behind pipelining is to reduce the amount of wasted work by having each machine start computation on a new minibatch immediately after it sends its outputs to the next machine. This means multiple mini-batches will be trained in parallel prior to updating weights, which can affect algorithm convergence, but as discussed in section one, techniques such as weight stashing can reduce this effect.
After some number of mini-batches enter the pipeline for training, the pipeline will reach what’s called a steady state in which all machines are working during each time unit. The below graphic shows how pipelining dramatically increases the utilization of computational resources. Clearly, not pipelining leads to underutilization and longer distributed training times.
Not balancing pipeline stages
Building on the discussion of pipelining, it’s important to ensure that your pipeline is balanced. The diagrams are shown before making the naive assumption that each machine executes the forward/backward passes of its model partition on a minibatch of data in exactly the same amount of time. In practice, this never happens, so there will be some brief periods where either a machine is idle and waiting on the next minibatch from the previous machine or takes longer than other machines to execute its computation, thus slowing down the pipeline.
You should ideally construct your pipeline such that each machine does as close to the same amount of computation as possible. This means timing how long it takes data to get through different layers in the model, timing how long forward and backward propagation takes for each model partition, and ensuring roughly equivalent data sizes across mini-batches. This is critical for optimizing pipeline efficiency.
Weight staleness is used to describe a phenomenon that occurs in pipelined training of deep learning models. When model training is pipelined across multiple machines, there is a delay that happens between when the forward computation on data occurs and when the gradients based on that computation are backpropagated to update the model weights. In effect, after some training steps, the model forward predictions are calculated using weights that are a certain number of cycles older than the gradients which are being passed backward through the pipeline. This delay is referred to as “weight staleness”.
There are a few ways to fix weight staleness. The first is to simply synchronize machines by mutually communicating gradients after every certain number of steps. This caps the degree to which weights can become stale. However, it reduces hardware efficiency because synchronizing machines causes certain machines to have to wait for others to arrive at the same state. Therefore, it is not ideal for reducing training time.
Alternatively, in a method titled PipeDream, researchers introduced a technique called weight stashing in which a system “maintains multiple versions of a model’s weights, one for each minibatch.” After the completion of each forward pass, the system can store a model’s weights as part of the state associated with that minibatch. When the time comes for backpropagation, the weights associated with that minibatch are retrieved from the stash and used for the gradient computation. This ensures that the same version of weights are used for the forward and backward pass over a single minibatch of data within a pipelined stage, and statistical convergence is improved.
Using model parallel when you need data parallel (and vice versa)
The two main paradigms of distributed training are model parallel and data parallel. In model parallel, a large model is distributed across multiple machines. For example, you might put layers 1-4 of the model on machine one, layers 5-8 on machine two, layers 9-12 on machine three, etc. This mode of training is becoming increasingly common as large language models such as GPT-3 with billions of model parameters begin to exceed the size at which the full model can be stored on commodity hardware. During training, data and gradients are communicated between machines.
Conversely, in the data parallel paradigm, the full model is stored on each machine, but each machine is assigned to train that model on only a subset of the full dataset. This is useful when your full dataset won’t fit on a single machine or if you simply have multiple machines that can all fit your single model and want to reduce your training time. Model weights and gradients are communicated during training. There also exists mixed-mode types of training in which the data and model parallel are used in tandem to get the best of both worlds.
Obviously, choosing either model or data parallel is important to fitting the right training approach to your circumstances. You will want to choose a model parallel if you are training an exceptionally large model that cannot fit on a single machine. Choose data parallel if you have a very large training set that can be easily partitioned across machines. Finally, you should consider using a mixed-mode training paradigm if both your model and dataset are too large to fit on a single machine.
You may also like
Can GPT-3 or BERT Ever Understand Language?—The Limits of Language Deep Learning Models
Driver and library inconsistencies between machines
In an ideal world, a distributed training script would work flawlessly across homogeneous sets of machines, but such is not the reality. Oftentimes, unexpected errors can crop up when machines are set up to use incompatible versions of hardware, drivers, and libraries. For example, if your distributed training machine learning script is implemented using PyTorch, then you should make sure that each machine has the same version of PyTorch installed. Similarly, you would ideally like each machine to use compatible versions of CUDA, although this is not as strict a requirement as most distributed training libraries is capable of interfacing with different CUDA versions.
One easy way to standardize your training environment across all machines is to use a containerization solution such as Docker. Docker packages your distributed training script with all the OS definitions, libraries, tools, and files needed to run it. Containers are also lightweight, meaning they don’t require much overhead to run once the container image has been built. Using containers or a similar virtualization solution allows you to ensure that your runtime environments are mostly uniform across machines and prevents many of the headaches and friction that may come from incompatibilities in software setups.
How to Keep Track of Machine Learning Experiments in PyTorch
Using the wrong type of SGD
Most machine learning models are trained using an optimization algorithm, often some variant of stochastic gradient descent. There are two primary ways to enact SGD in a distributed setting, synchronously or asynchronously.
In synchronous SGD, each machine performs its weight updates in tandem. This means that each machine computes the gradients on its portion of the data, transmits them to the other machines, and in turn, waits for all the other machines to send their gradients. Only after receiving gradients from all the other machines does each machine perform its backpropagation and update its weights. This way, the weights on each machine stay in sync and prevent weight staleness. However, doing this requires imposing a hard barrier at the end of each mini-batch training which leads to slowdowns, particularly if there are straggler machines that take much longer than the others to perform their computations.
In practice, asynchronous SGD can be a very good option for distributed training. Perhaps the most famous asynchronous SGD is HogWild which showed that SGD could be run in parallel, without locks, and without too much effect on algorithm convergence. Asynchronous SGD allows weight updates to proceed without each machine waiting for the other to send their gradients. In effect, this means that the weight updates occur using only partial information, as each machine only has access to gradients derived from its subset of the training data. However, in practice, simple strategies can be used to ensure that the algorithm still converges.
Network issues, firewalls, ports, and communication errors
The problem #1
Communication between computers, particularly communication over a network, is central to distributed training. Of course, network communication brings with it a host of potential issues, such as dropped packets, network attacks, etc.
At the heart of a distributed training script is an enumeration of machines referred to as rank. The rank of a machine is its number within the lineup of all machines involved in the model training process. Typically, one machine is assigned rank 0 and is tasked with handling the coordination of communication between other machines, aggregation of sent data and gradients, etc. The other machines communicate data and gradients over the network to all other machines of non-equal rank. This can easily go wrong if certain things are not taken care of.
When sending data to machines of non-equal rank, a process must specify the IP address and port number across which to transmit this information. Any mistakes in this specification will obviously lead to failures in the training process. Similarly, if a firewall or other network configuration setting prevents communication on the specified port, then the receiving machine will never get the necessary data. Thus, ensuring that the network is configured correctly and specified properly in the training script is a prerequisite for a functioning distributed training machine learning setup.
The problem #2
NCCL and Gloo are the two libraries used by the PyTorch Distributed package to communicate data and model parameters between machines during distributed training. Periodically sharing weights and data between machines is crucial to fully train a functional model. Unfortunately, when working in a distributed setting, single machine failures are common and often happen without any apparent reason. For example, a single (or multiple) machine(s) may run out of RAM, its hard disk may spontaneously fail, or it may be affected by a network or power outage.
In these situations, when another machine attempts to receive data from the failed machine, NCCL or GLOO may present cryptic error messages. These errors are so common and confusing that there are entire Github issues dedicated to resolving them.
There are certain parameter settings that can help make errors more readable (i.e. setting NCCL_DEBUG=WARN), but regardless, this doesn’t fully solve the issue.
Unfortunately, these sorts of issues are part and parcel of distributed systems programming. Distributed training best practices for resolving these sorts of errors include making frequent backups during training (periodically saving snapshots of the model weights), writing verbose logs that can be used to trace errors back to their source, and ensuring that hardware is well maintained (less of an issue in the present age of cloud computing).
Slow data transmission
RPCs, or Remote Procedure Calls, are the basic elements of communication across networks. In a distributed training ML setting, RPCs might be made when running a reinforcement learning agent across a network. Similarly, network transmissions are made when gradients and layer outputs are sent between machines. In general, you want to avoid making RPCs or sending them across a network whenever possible, as network communication can be slow and costly, especially if you have a lot of data to transfer. Many distributed training mistakes also arise from network issues, so reducing reliance on the network can also reduce the number of errors encountered. Of course, you can’t avoid using the network entirely, but care should be taken not to use the network frivolously.
In cases where you absolutely need to transfer data between machines, there are many things you can do to accelerate the process. First, you can use dedicated hardware to facilitate network transfers. This can include using worker machines connected by high-speed interconnects and using Nvlink for the transmission of data between Nvidia GPUs.
You can also reduce the floating point precision of the transmitted gradients in order to reduce the data size being transferred. Going from fp32 to fp16 or even fp8 can have a significant impact on speed. Beware, though, it can also affect algorithm convergence, so use wisely!
Finally, it’s possible to transmit a subset of gradients as soon as they are calculated (i.e. sending the gradients of a single layer) while at the same time, backpropagation is being performed on subsequent layers. Overlapping the data transmission with backpropagation is another form of pipelining and ensures that the network is being utilized more efficiently. This can speed up overall gradient transfer times and prevent the network from becoming saturated.
Not logging enough
Logging data is especially important for debugging when training in a distributed setting. Because there are so many points of potential failure in a distributed training setup, care should be taken to write detailed logs at multiple stages during the training process so that when errors do occur, they can be localized and readily traced back to the source. In particular, all major distributed actions should be logged, such as when data is communicated between machines, when weight updates are executed, etc. Finally, logs should be searchable and easily accessed. Doing all of this ensures that when a distributed training error occurs, it can be easily traced back to the source, and downtime of the system can be minimized.
MLOps tools, such as neptune.ai, let you automatically log all of the relevant metadata, like metrics, parameters, learning rates, and variables in a distributed training setup. You can log into four different objects: Run, Model, ModelVersion, and Project, and organize your metadata as you want, thanks to the flexible metadata structure.
You can later view everything you logged in the Neptune web app, for example, in dashboards, like the one below.
Check this tutorial to learn how to track distributed training jobs with Neptune. It includes:
- Tracking a single-node multi-GPU job
- Tracking a multi-node DDP job
You can also check the documentation on using Neptune with pipelining libraries.
Overspending on cloud computing
Distributed training often involves using cloud providers such as AWS or GCP. If you’re not careful, it’s easy to rack up bills of hundreds or even thousands of dollars per month using these cloud services.
The first thing you can do to reduce your costs is to use spot or preemptible instances. In short, these instances are often significantly cheaper by virtue of the fact that you allow for the possibility of your job being preempted or interrupted during training. If you do use spot instances, it’s important that you checkpoint your training regularly or use a fault-tolerant solution such as Elastic Horovod in case your instance does get shut down during the middle of training.
It’s also important that you don’t select compute resources that are more than you need. For example, if you assign each of your instances a 64GB GPU, this might be overkill, given that you are already working in a distributed setup. It might be cheaper to actually use more instances with smaller resources than to use two or three large instances with very expensive GPUs. Performing this sort of cost-benefit analysis beforehand can improve your training time as well as reduce the cost needed to complete training with your desired cloud provider.
Conclusion and takeaways
Distributed model training is challenging and involves managing a variety of bugs. While some errors are inevitable, it is best to avoid the traps and common machine learning pitfalls that arise when training in a distributed setting.
Issues related to pipelining, such as imbalanced stages or incorrect choice of model or data parallel, or simply not pipelining at all, can negate the advantages of distributed training. Similarly, network-related issues can arise, and training can dramatically slow down when making too many RPCs, and improperly configuring firewalls or IP addresses. Meanwhile, other issues, such as weight staleness or choosing the wrong type of distributed SGD can negatively affect algorithm convergence.
By avoiding these common distributed training errors, you can ensure that you start off on the right track toward training extremely large and complex machine learning models by utilizing the power of distributed training.