Recent developments in deep learning have led to some fascinating state-of-the-art results especially in the areas like natural language processing and computer vision. A couple of the reasons for the success usually comes from the availability of a huge amount of data and the increasing size of deep learning (DL) models. These algorithms are capable of extracting meaningful patterns and deriving correlations between the input and the output. But it is also true that developing and training these complex algorithms can take days and sometimes even weeks.
To manage this problem, a fast and efficient approach to designing and developing new models is needed. One cannot train these models on a single GPU because it will result in an information bottleneck. To solve the issue of information bottlenecks on a single core GPU we need to use multi-core GPUs. This is where the idea of distributed training comes into the picture.
In this article, we’ll look into some of the best frameworks and tools for distributed training. But before that, let’s have a quick overview of distributed training itself.
The DL training usually relies on scalability, which simply means the ability of the DL algorithm to learn or deal with any amount of data. Essentially the scalability of any DL algorithm depends on three factors:
- 1 Size and the complexity of the deep learning model
- 2 Amount of training data
- 3 Availability of infrastructure which includes hardware like GPUs and storage units, and smooth integration between these devices
Distributed training satisfies all three elements. It takes care of the model size and complexity, handles training data in batches, and it splits and distributes the training process among multiple processors called nodes. More importantly, it reduces the training time significantly making iteration time shorter and thus making experiments and deployment quicker.
Distributed training is of two types:
- 2 Model-parallel training
In data-parallel training, the data is divided into subsets based upon the number of nodes available for training. And the same model architecture is shared in all the available nodes. During the training process, all the nodes must communicate with each other to ensure that the training at each node is synced with each other. It is the most efficient way of training the model and the most common practice.
In model-parallel training, the DL model is split into segments based on the number of nodes available. Each node is fed with the same data. In model-parallel training, the DL model itself is split into different segments and each of the segments is then fed into different nodes. This type of training is possible if the DL model has independent components that can be trained individually. It is kept in mind that each of the nodes must be in sync with regard to the shared weights and biases of the different segments of the model.
Among the two types of training, data-parallelism is quite commonly used and as we discover the frameworks for distributed training you will find that data-parallelism is offered by all of them irrespective of model-parallelism.
Criteria for choosing the right framework for distributed training
Before we dive into the frameworks there are some points that one should consider while choosing the right framework and tools:
- Computational graph type: The whole deep learning community is majorly divided into two factions, one that uses PyTorch or dynamic computational graph and the other that uses TensorFlow or static computational graph. Hence, it is not news that most of the distributed frameworks are built on top of these two libraries. So if you prefer one over the other then half of your decision is already made.
- Cost of training: Affordability is a critical concern when you are dealing with distributed computing, e.g. a project involving the training of BigGAN can require a number of GPUs and the cost could scale up proportionally as this number increases. Hence, a tool with moderate pricing is always the right choice.
- Type of training: Depending upon your training requirement i.e. data-parallelism or model-parallelism, you can prefer one tool over the other.
- Efficiency: This basically refers to the number of lines you need to write to enable distributed training, the less the better.
- Flexibility: Can the framework of your choice be used across different platforms? Especially when you need to train on-premise or on cloud platforms.
Frameworks for distributed training
Now, let’s discuss some of the libraries that offer distributed training.
May be useful
In Neptune, you can track data of your run from many processes, in particular running on different machines.
PyTorch is one of the most popular deep learning frameworks developed by Facebook. It is one of the most flexible and easy-to-learn frameworks. PyTorch allows you to create and implement neural network modules very effectively and with its distributed training modules you can easily implement parallel training with a few lines of code.
PyTorch offers a number of ways in which you can perform distributed training:
- nn.DataParallel: This package allows you to perform parallel training in a single machine with multiple GPUs. One advantage is that it requires a minimum code.
- nn.DistributedDataParallel: This package allows you to perform parallel training across multiple GPUs within multiple machines. It requires a few more extra steps to configure the training process.
- torch.distributed.rpc: This package allows you to perform a model-parallelism strategy. It is very efficient if your model is large and does not fit in a single GPU.
- It is easy to implement.
- PyTorch is very user-friendly.
- Offers data-parallelism and model-parallelism methods out-of-the-box.
- The majority of the cloud computing platforms support PyTorch.
When to use PyTorch?
You should opt for PyTorch when:
- You have a huge amount of data because data parallelism is easy to implement.
PyTorch distributed training specializes in data parallelism. DeepSpeed, which is built on top of PyTorch, targets other aspects i.e. model-parallelism. DeepSpeed is developed by Microsoft that aims to offer distributed training for large-scale models.
DeepSpeed can efficiently tackle memory challenges when training models with trillions of parameters. It reduces memory footprint while maintaining compute and communication efficiency. Interestingly, DeepSpeed offers 3D parallelism through which you can distribute data, model, and pipeline, which basically means that now you can train a model which is large and consumes a huge amount of data, something like a GPT-3 or a Turing NLG.
- Model scaling up to trillions of parameters.
- Faster training up to 10X.
- Democratize AI which means users can run bigger models on a single GPU without running out of memory.
- Compressed training allows users to train attention models by reducing the memory required to compute attention operations.
- Easy to learn and use.
When to use DeepSpeed?
You should opt for DeepSpeed when:
- You want to do data and model parallelism.
- If your codebase is based on PyTorch.
TensorFlow is developed by Google and it supports distributed training. It uses data-parallel techniques for training. You can leverage the distributed training on TensorFlow by using the tf.distribute API. This API allows you to configure your training as per your requirements. By default, TensorFlow uses only one GPU but the tf.distribute allows you to use multiple GPUs.
TensorFlow provides three primary types of distributed training strategy:
- tf.distribute.MirroredStrategy(): This simple strategy allows you to distribute training across multiple GPUs on a single machine. This method is also called Synchronous Data-Parallelism. It is worth noting that each worker node will have its own set of gradients. These gradients are then averaged and used to update the model parameters.
- tf.distribute.MultiWorkerMirroredStrategy(): This strategy allows you to distribute training across multiple machines and multiple GPUs on a single machine. All the operations are similar to tf.distribute.MirroredStrategy(). It is also a Synchronous Data-Parallelism method.
- tf.distribute.experimental.ParameterServerStrategy(): This is an Asynchronous Data-Parallelism method, it is common practice to scale-up model training on multiple machines. In this strategy, the parameters are stored in a parameter server and workers are independent of each other. This strategy scales up well because the worker nodes are not waiting for the parameter update from each other.
- Huge community support.
- It is a static paradigm of programming.
- Very well integrated with Google Cloud and other cloud-based services.
When to use Distributed TensorFlow?
You should use Distributed TensorFlow:
- If you want to do data parallelism.
- If you like the static paradigm of programming compared to dynamic.
- If you are in the Google Cloud ecosystem since TensorFlow is very well optimized for TPUs.
- Lastly, if you have huge data and need high processing power.
Mesh Tensorflow is again an extension of Tensorflow distributed training but is specifically designed to train large DL models on Tensor Processing Unit (TPUs) which AI accelerates like the GPUs but faster. Although Mesh TensorFlow can execute data-parallelism, it aims to solve distributed training for large models whose parameters cannot fit on one device.
Mesh TensorFlow is inspired by a synchronous data-parallel method, i.e. every worker is involved in every operation. Apart from that, all the workers will have the same program and it uses collective communication like Allreduce.
- It can train large models with millions and billions of parameters like: GPT-3, GPT-2, BERT, et cetera.
- Potentially low latency across the workers.
- Good TensorFlow community support.
- Availability of TPU-pods from Google.
When to use Mesh Tensorflow?
You should use Mesh TensorFlow:
- If you want to do model parallelism.
- If you want to develop huge models and practice rapid-prototyping.
- If you are especially working in the area of Natural Language Processing with huge data.
Apache Spark is one of the most well-known, open-source big data processing platforms. It allows users to do all kinds of data-related work like data engineering, data science, and machine learning. We already know what TensorFlow is. But if you wanna use TensorFlow on Apache Spark then you have to use TensorFlowOnSpark.
TensorFlowOnSpark is a machine learning framework that allows you to perform distributed training on Apache Spark Clusters and Apache Hadoop. It was developed by Yahoo. The framework allows both distributed training and inference with minimum code changes to existing TensorFlow code on the shared grid.
- Allows easy migration to Spark Clusters with existing TensorFlow programs.
- Fewer changes in the code.
- All TensorFlow functionalities are available.
- Datasets can be efficiently pushed and pulled by Spark and TensorFlow respectively.
- Cloud development is easy and efficient on CPUs or GPUs.
- Training pipelines can be created easily.
When to use TensorFlowOnSpark?
You should use TensorflowOnSpark:
- If your workflow is based on Apache Spark or if you prefer Apache Spark.
- If your preferred framework is TensorFlow.
BigDL is also an open-source framework for distributed training for Apache Spark. It was developed by Intel to allow DL algorithms to run Hadoop and Spark clusters. One big advantage of BigDL is that it can help you to easily build and process production data in an end-to-end pipeline for both data analysis and deep learning applications.
BigDL provides two options:
- You can directly use BigDL as you would any other library that Apache Spark provides for data engineering, data analytics et cetera.
- You can scale out python libraries like PyTorch, TensorFlow, and Keras in the Spark ecosystem.
- End-to-end pipeline: If your big data is messy and complex which is usually in the case of live data streaming, then adopting BigDL is appropriate because it integrates data analytics and deep learning in an end-to-end pipeline.
- Efficiency: With an integrated approach across different components of Spark BigDL makes development, deployment, and operations direct, seamless, and efficient across all components.
- Communication and Computing: Since all the hardware and software are stitched together, they run smoothly without any interruption, making communication with different workflows clear and computing faster.
When to use BigDL?
You should use BigDL:
- If you want to develop an Apache Spark workflow,
- If your preferred framework is PyTorch.
- If you want to have continuous integration of all the components like data mining, data analytics, machine learning et cetera.
Horovod was introduced by Uber in 2017. It is an open-source project that is specifically made for distributed training. It is an internal component of Michelangelo, a deep learning toolkit that Uber uses for implementing its DL algorithms. Horovod leverages data-parallel distributed training, which makes scaling very easy and efficient. It can also scale up to hundreds of GPUs in a matter of 5 lines of python code. The idea is to write a training script for a single GPU and Horovod can scale it to train on multiple in parallel.
Horovod is built for frameworks like Tensorflow, Keras, Pytorch, and Apache MXNet. It is easy to use and fast.
- Easy to learn and implement if you are familiar with Tensorflow, Keras, Pytorch, and Apache MXNet.
- If you are using Apache Spark then you can unify all the processes on a single pipeline.
- Good community support.
- It is fast.
When to use Horovod?
You should use Horovod:
- If you want to scale a single GPU script quickly across multiple GPUs.
- If you are using Microsoft Azure as your cloud computing platform.
Ray is another open-source framework for distributed training built on top of Pytorch. It provides tools for launching GPU clusters on any cloud provider. Unlike any other libraries we have discussed so far, Ray is very flexible and can work anywhere like Azure, GCD, AWS Apache Spark, and Kubernetes.
Ray offers the following libraries in its bundle for hyperparameter tuning, reinforcement learning, deep learning, scaling et cetera:
- Tune: Scalable Hyperparameter Tuning.
- RLlib: Distributed Reinforcement Learning.
- Train: Distributed Deep Learning, currently in beta version.
- Datasets: Distributed Data Loading and Compute, currently in beta version.
- Serve: Scalable and Programmable Serving.
- Workflows: Fast, Durable Application Flows.
Apart from these libraries, Ray also has integration with third-party libraries and frameworks which allows you to develop, train and scale your workloads with minimal code changes. Given below is the list of integrated libraries:
- Hugging Face Transformers
- Intel Analytics Zoo
- John Snow Labs’ NLU
- Ludwig AI
- PyTorch Lightning
- Scikit Learn
- Seldon Alibi
- It supports Jupyter Notebook
- It makes your code run parallel in single and multiple machines
- It integrates multiple frameworks and libraries.
- It works with all the major cloud computing platform
When to use Ray?
You should use Ray:
- If you want to perform distributed reinforcement learning
- If you want to perform distributed hyperparameter tuning
- If you want to use distributed data loading and compute across different machines.
- If you want to serve your application.
Cloud platforms for distributed training
So far, we have discussed the frameworks and the libraries that can be used to enable distributed training. Now, let’s discuss and explore the cloud platforms where you can get into the hardware that will allow you to efficiently train your DL models. But before that, let’s lay out some criteria that will allow you to choose the best cloud platforms as per your requirements.
- Hardware and Software Support: It is important to learn and understand what hardware these platforms offer like GPUs, TPUs, storage units et cetera. Apart from that, one should also see the API that they offer so that (depending on your project) you can get access to hosting facilities, containers, tools for data analytics and so forth.
- Availability Zones: Availability zones are an important factor in cloud computing, it gives users the flexibility to set up and deploy their project anywhere in the world. Users can also shift their projects whenever they want to.
- Pricing: Whether the platform charges you based on your usage or do they offer a subscription-based model.
Now, let’s discuss cloud computing options. We will discuss the two extremely feasible ready-to-use experimental notebook platforms and the three most popular cloud computing services.
1. Google Colab
Google Colab is one most reliable and easy to use platforms for small scale to medium-scale projects. One good thing about Google Colab is that you can easily connect to Google Cloud with ease and you can work with any python library mentioned above. It offers three models:
- Google Colab is free of cost and it gives you access to GPUs and TPUs. But you will get access to limited storage and memory. Once any of them exceeds, the program stops.
- Google Colab Pro is a subscription version of Google Colab where you have extra memory and storage. You can fairly run a heavy model but again is it limited.
- Google Colab Pro + is the new service which is a subscription based model and which is also expensive. It offers faster GPUs and TPUs plus extra memory so that you can run fairly larger models on fairly large datasets.
Given below is the official comparison of all the three.
AWS SageMaker is one of the most popular and the oldest cloud computing platforms for distributed training. It is very well integrated with Apache MXNet, Pytorch, and TensorFlow and allows you to deploy deep learning algorithms with ease and less code modification. SageMaker API has 18+ machine learning algorithms, some of which are rewritten from scratch to make the whole process scalable and easy. These built-in algorithms are optimized to get the most out of the hardware.
SageMaker also has an integrated Jupyter Notebook that allows Data-scientist and machine learning engineers to build and develop pipeline algorithms on the go and allows you to directly deploy them in a hosted environment. You can configure hardware and environments based on your requirements and preferences from SageMaker Studio or SageMaker console. All the hosting and development are billed according to the usage per minute.
AWS SageMaker offers both data-parallelism as well as model-parallelism distributed training. In fact, SageMaker also offers a hybrid training strategy where you can use both model and data parallelism.
Google Cloud Computing was developed by Google in 2010 to strengthen their own platforms like Google search engine and Youtube. Gradually, they started open-sourcing it to the public. Google Cloud Computing offers the same infrastructure that all Google’s platforms use.
Google cloud computing offers in-built support for libraries like TensorFlow, Pytorch, Scikit-Learn, and many more. Furthermore, apart from configuring GPUs in your workflow, you can add TPUs as well to make the training process go much faster. Like I mentioned before you can connect your Google Colab to Google Cloud Platform and access all the features that it provides.
Some of the features that it provides are:
- Compute (Virtual Hardwares like GPUs and TPUs)
- Storage Bucket,
- Management tools
- API platform
- Hosting Services
It is worth noting that GCP has less availability zones but it is also less expensive compared to AWS.
Microsoft Azure is another very popular cloud computing platform. One of the popular language models GPT-3 from OpenAI was trained in Azure. It also offers both data parallelism and model parallelism methods and supports both TensorFlow and Pytorch. In fact, if you want to optimize computing speed then you can also leverage Horovod from Uber.
Azure machine learning service is for both coders and non-coders. It simply offers a drag and drop approach that can optimize your workflow. It also reduces manual work with automated machine learning that can help you to develop smarter working prototypes.
The Azure Python SDK also allows you to interact in any Python environment like Jupyter Notebooks, Visual Studio Code, and many more. It is quite similar to both AWS and GCP in terms of offering services. These are the services that Azure offers:
- AI, Machine Learning and Deep learning
- Computing powers (GPUs)
- Developer Tools
- Internet of Things
- Mixed Reality
- Networking et cetera
Let’s also compare the main 3 tools side-by-side to give you a better perspective about making the choice.
Comparison table for cloud platform
In this article, we saw different Libraries and tools that can help you implement distributed training for your own deep learning application. Bear in mind that all the libraries are good and very effective in what they do, eventually, it all boils down to your preferences and requirements.
You must have noticed that all the frameworks discussed have primarily Pytorch and TensorFlow integration in some way or another. This trait can easily help you isolate the framework of choice. Once your framework is decided you can then look at the advantages to decide which distributed training tool works the best for you.
I hope you enjoyed this article. If you wanna try out all the frameworks we discussed then follow the tutorial link.
Thanks for reading!