Pytorch is one of the most widely used deep learning libraries, right after Keras. It provides agility, speed and good community support for anyone using deep learning methods in development and research.
Pytorch has certain advantages over Tensorflow. As an AI engineer, the two key features I liked a lot are:
- Pytorch has dynamic graphs (Tensorflow has a static graph), which makes Pytorch implementation faster, and adds a pythonic feel to it.
- Pytorch is easy to learn, whereas Tensorflow is a bit difficult, mostly because of its graph structure.
The only problem I had with Pytorch is that it lacked structure when the models were scaled up. The model gets complicated as more and more functions are introduced in an algorithm, making it difficult to keep track of the details. Something like Keras would be beneficial to provide a high-level interface with a simple call function.
Today, the Pytorch community is quite big, and different groups of people have created high-level libraries addressing the same issue. In this article, we’ll explore two libraries: Pytorch Lighting and Pytorch Ignite, which offer flexibility and structure for your deep learning code.
Comparison: Pytorch Lighting vs Pytorch Ignite
|
Lightning
|
Ignite
|
Learning curve |
⭐⭐⭐⭐⭐ |
⭐⭐⭐ |
Interface |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
Reproducibility |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
Integration framework |
TensorBoard, Neptune, MLflow, Wandb, Comet |
TensorBoard, Neptune, MLflow, Wandb, Polyaxon |
Documentation |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
Hardware support |
CPU, GPU, TPU |
CPU, GPU, TPU |
Distributed training |
Yes (easy) |
Yes (complicated) |
Production |
⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
Metric |
Functional metrics and Module Metrics Interface |
If you don’t define a metric, it will be selected based on the task. In this article, we define it (autosklearn.metrics.roc_auc) |
What is Pytorch Lightning?
Lightning is a high-level python framework built on top of Pytorch. It was created by William Falcon, while he was doing his PhD. It was created for researchers, specifically for trying new deep learning models which involved research scaling, multi-GPU training, 16-bit precision and TPU.
Why Lightning?
Lightning was created with the intention to scale and speed up the research process by eliminating low-level code while keeping the code readable, logical, and easy to execute.
Lightning provides structure to pytorch functions where they’re arranged in a manner to prevent errors during model training, which usually happens when the model is scaled up.
Key features
Pytorch Lightning comes with a lot of features that can provide value for both professionals, as well as newcomers in the field of research.
- Train models on any hardware: CPU, GPU or TPU, without changing the source code
- 16 bit precision support : train models faster by cutting memory usage in half
- Readability: reduce unwanted or boilerplate code, to focus on the research aspect of the code
- Remove unwanted or boilerplate code
- Interface: concise, neat and easy to navigate
- Easier to reproduce
- Extensible: you can use multiple mathematical functions (optimizers, activation functions, loss function and more)
- Reusability
- Integration with visualization frameworks like Neptune.ai, Tensorboard, MLFlow, Comet.ml, Wandb
Benefits of using PyTorch Lightning
Lightning API
The Lightning API offers the same functionality as raw Pytorch, only much more structured. While defining the model, you don’t have to change any code, it’s exactly the same, all you need to do is inherit the LightningModule instead of nn.module. The LightningModule takes care of all the important aspects of modeling the deep learning network, like:
- Defining the architecture of the model (init)
- Defining the training, validation, and testing loop (training_step, validation_step, and test_step respectively)
- Defining the optimizer (configure_optimizers)
- Lightning has its own LightningDataModule; you can create your own training, validation and testing dataset and then pass it to the trainer module.
Let’s see what defining a model in Lightning looks like.
class MNISTModel(pl.LightningModule):
def __init__(self):
super(MNISTModel, self).__init__()
self.l1 = torch.nn.Linear(28 * 28, 10)
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_nb):
x, y = batch
loss = F.cross_entropy(self(x), y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
As you can see, the LightningModule is simple and similar to Pytorch. It takes care of all the important methods that need to be defined, like:
- __init__ takes care of the model and associated weights
- forward: same as Pytorch Forward, connects all the different components of the architecture and does a forward pass
- training_step: defines the training loop and its functionality
- configure_optimizers: defines the optimizers
There are other functions, too:
- test_step
- test_end
- configure_optimizers
- validation_step
- validation_end
The trainer method takes care of configuring the training criteria (number of epochs, training hardware: CPU, GPU and TPU, number of GPUs, etc). The main job of the trainer is to decouple the engineering code from the research code.
In the end, all you need to do is call the .fit method from the trainer instance, pass the defined model and data loader, and execute it.
# Init our model
mnist_model = MNISTModel()
# Init DataLoader from MNIST Dataset
train_ds = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
train_loader = DataLoader(train_ds, batch_size=32)
# Initialize a trainer
trainer = pl.Trainer(gpus=1, max_epochs=3, progress_bar_refresh_rate=20)
# Train the model ⚡
trainer.fit(mnist_model, train_loader)
Metrics
The purpose of metrics is to allow the user to monitor and measure the training process with some sort of mathematical standard like: accuracy, AUC, RMSE et cetera. It is different then the loss function; while loss function measures the difference between the prediction and the ground and simultaneously updates the weights using parameters, metrics are used to monitor how a model is performing in the training set and validation test. This provides insightful behaviour for the model’s performance.
Lightning comes with two types of metrics:
- Functional metrics
- Module Metrics Interface
Functional metrics
Functional metrics allow you to create your own metric as a function as per your requirement. Pytorch provides you with a tensor_metric decorator which essentially takes care of converting all inputs and outputs to tensors such that it synchronizes the Metric’s output across all DDP nodes (if DDP was initialized).
import torch
from pytorch_lightning.metrics import tensor_metric
@tensor_metric()
def rmse(pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
return torch.sqrt(torch.mean(torch.pow(pred-target, 2.0)))
Pytorch also provides:
- numpy_metric: wrapper for metric functions implemented with numpy
- tensor_collection_metric: wrapper for metric whose outputs cannot be converted to torch.Tensor’s completely
Module Metrics Interface
Module Metrics Interface allows you to provide a modular interface for the metrics. It takes care of the tensor conversion along with handling the DDP sync and i/o conversions.
import torch
from pytorch_lightning.metrics import TensorMetric
class RMSE(TensorMetric):
def forward(self, x, y):
return torch.sqrt(torch.mean(torch.pow(x-y, 2.0)))
Another way to use module metric interface is by creating a metric function using plain pytorch and derive a class from lightning base class and call your Metric within the forward:
import torch
from pytorch_lightning.metrics import TensorMetric
def rmse(pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
return torch.sqrt(torch.mean(torch.pow(pred-target, 2.0)))
class RMSE(TensorMetric):
def forward(pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
return rmse(pred, target)
Hooks
Lightning also has handlers known as hooks. Hooks help users to interact with the trainer during training. Basically, it lets users take a certain action during training.
For example:
on_epoch_start: this is called in the training loop at the very beginning of the epoch:
def on_epoch_start(self):
# do something when the epoch starts
In order to enable a hook, make sure that you override the method in your LightningModule and define the operation or task that needs to be done during the and the trainer will call it at the correct time.
on_epoch_end: this is called in the training loop at the very end of the epoch:
def on_epoch_end(self):
# do something when the epoch ends
To know more about hooks check out this link.
Distributed training
Lightning provides multi-GPU training and 5 distributed backend trainings:
- DataParallel ‘dp’
- DistributedDataParallel ‘ddp’
- DistributedDataParallel-2 ‘ddp2’
- DistributedDataParallel Sharded ‘dpp_sharded’
- DeepSpeed ‘deepspeed’
These settings can be configured from the trainer instance before calling the .fit method.
#default (when using single GPU or no GPUs)
Trainer = Trainer(distributed_backend = None)
#change to data parallel (gpus > 1)
Trainer = Trainer(distributed_backend ='dp')
#change to distributed data parallel (gpus>1)
Trainer = Trainer(distributed_backend ='ddp')
trainer = Trainer(gpus=4, plugins='ddp_sharded')
Notice that in order to use the sharded distribution you need to call it from the plugins parameter.
Check out this article to get a more in-depth understanding of sharded distributions.
The same goes with deepspeed as well:
trainer = Trainer(gpus=4, plugins='deepspeed', precision=16)
To learn more about deepspeed check this article.
Reproducibility
With that, reproducibility becomes very easy. To reproproduce the same results run after run all you need to do is to set the seed value of the pseudo-random generator and make sure that the deterministic parameter in the trainer is True.
from pytorch_lightning import Trainer, seed_everything
seed_everything(23)
model=Model()
Trainer = Trainer(deterministic = True)
With the above configuration you can now scale up the model without even worrying about the engineering aspect of the model. Rest assured that everything is taken care of by the Lightning Module.
It standardizes the code.
All you need to do is take care of the research aspect, which includes manipulating the mathematical functions, adding a layer of neurons, or even changing the training hardware.
It decouples engineering from research.
Integration with Neptune
Lightning provides seamless integration with Neptune. All you need to do is call the NeptuneLogger module:
from pytorch_lightning.loggers.neptune import NeptuneLogger
neptune_logger = NeptuneLogger(
api_key="ANONYMOUS",
project_name="shared/pytorch-lightning-integration",
close_after_fit=False,
experiment_name="train-on-MNIST",
params=ALL_PARAMS,
tags=['1.x', 'advanced'],
)
Set all the required parameters as shown above, and then pass it as a parameter to the trainer function, and you’re all set to monitor your model through the Neptune dashboard.
trainer = pl.Trainer(logger=neptune_logger, checkpoint_callback=model_checkpoint, callbacks=[lr_logger], **Trainer_Params)
Production
Deploying lightning models to production is also very straight forward, as simple as using .to_torchscript, .to_onnx and there are three ways in which you can save the model for production:
- Saving the model as a PyTorch checkpoint
- Converting the model to ONNX
- Exporting the model to Torchscript
To get more in-depth knowledge about model deployment and production check out this article.
Community
The Lightning community is growing. There are almost 390 contributors, a core team of 11 research scientists, PhD students, and more than 17k active users. Because the community is growing at a rapid pace, the documentation is really important.
If you find yourself with any problem, you can ask for help on Lightning’s Slack or Github.
The documentation for Lightning is very neat, readable, and easy to understand. Video explanations are also included.
When to use PyTorch Lightning
- Researching and producing new architecture.
- Looking for distributed and parallel training.
- Looking for CPU, GPU, and TPU training. In PyTorch, you can easily change the hardware from the trainer itself.
- It provides SOTA architecture so that you can tweak its settings for your own use.
When not to use PyTorch Lightning
- If you don’t know PyTorch, then learn PyTorch first and then use Lightning. You can check out lightning flash.
Code comparison: Pytorch vs Lightning
From the above example, you can see, Lightning provides a more dedicated functionality for each operation : building model, loading data, configuring optimizers et cetera, along with that it takes care of the boilerplate code like the configuring the training loop. It is much more focused on the research aspect more than the engineering aspect.
What is Ignite?
Ignite is another high-level library made on top of PyTorch. It helps with neural network training. Like Lightning, it was also created for researchers. It requires less coding from pure PyTorch, which adds flexibility and simplicity to the interface.
Why Ignite?
Ignite provides users with the interface that structures the architecture, criterion, and loss into one function for training and evaluation (optional). This feature makes Ignite grounded to the basics of Pytorch, while also making the user aware of the high level of abstraction separating from the engineering jargon (which could be configured later on, before training). This gives users a lot of flexibility.
Key features
Ignite provides three high-level features:
- Engine: this allows users to structure different configurations for training and evaluation;
- Out-of-the-box metrics: allows users to easily evaluate models;
- Built-in handlers: this lets users create training pipeline, logging, or in simple words – interact with the Engine.
Benefits of using PyTorch Ignite
Ignite API
Engine: The engine lets users run a given abstraction over each batch of a dataset, emitting events as it goes through epochs, logging, etc.
Let’s see how an Engine works:
def update_model(engine, batch):
inputs, targets = batch
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
return loss.item()
trainer = Engine(update_model)
As you can see, the abstraction or fundamentals for training a deep learning model is enclosed in a function update_model and then passed into the Engine. This is only the training function which incorporates backpropagation. There are no extra parameters or events defined.
trainer.run(data_loader, max_epochs=5)
To start the training, all you need to do is call the .run method from the trainer, and define max_epochs as per the requirement.
Events and handlers
Events and handlers help users interact with the engine during training. Basically, it lets users monitor the model. We’ll see two ways in which we can interact with the model: with the help decorator, and with the help of .add_event_handler.
The functions below print the results of the evaluator run on a training dataset using a @trainer.on decorator.
@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(trainer):
train_evaluator.run(train_loader)
metrics = train_evaluator.state.metrics
accuracy = metrics['accuracy']*100
loss = metrics['nll']
last_epoch.append(0)
training_history['accuracy'].append(accuracy)
training_history['loss'].append(loss)
print("Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}"
.format(trainer.state.epoch, accuracy, loss))
As you can see, the main element inside the function is train_evaluator, which basically performs evaluation on the training data and returns the metric. The same metric can be used to find the accuracy, loss, etc. All you have to do is give a print or a return statement in order to get the values.
Another method is to use the .add_event_handler of the trainer.
def log_validation_results(trainer):
val_evaluator.run(val_loader)
metrics = val_evaluator.state.metrics
accuracy = metrics['accuracy']*100
loss = metrics['nll']
validation_history['accuracy'].append(accuracy)
validation_history['loss'].append(loss)
print("Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}"
.format(trainer.state.epoch, accuracy, loss))
trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)
The above code uses the validation dataset to operate on the metrics. This is exactly the same as the previous one. The only difference is that we pass this function in the .add_event_handler method of the trainer, and it will work as the previous function.
There’s quite a number of built-in events that lets you interact with the trainer during training, or after the training is complete.
For instance, Events.EPOCH_COMPLETED will execute a certain function after the epoch is completed. Events.COMPLETED on the other hand will execute after the training is completed.
Metrics
Ignite provides metrics like accuracy, precision, recall, or confusion matrix, in order to compute various qualities.
For example, below we compute accuracy on the training dataset.
from ignite.metrics import Accuracy
def predict_on_batch(engine, batch)
model.eval()
with torch.no_grad():
x, y = prepare_batch(batch, device=device, non_blocking=non_blocking)
y_pred = model(x)
return y_pred, y
evaluator = Engine(predict_on_batch)
Accuracy().attach(evaluator, "val_acc")
evaluator.run(val_dataloader)
From the code above, you can see that the user has to attach the metric instance to an engine. The metric value is then computed using the output of the engine’s process_function.
Ignite also lets users create their own metric by using arithmetic operations.
Distributed training
Distributed training is supported in Ignite, but it has to be configured accordingly by the user, which can take a lot of effort. Users need to correctly set up the distributed process group, distributed sampler, and more. If you’re not familiar with how to set it up, it can be very tedious.
Reproducibility
Things that make Ignite reproducible are:
- Ignite’s ability to automatically handle random states, which makes sure that batches have the same distribution of data in different run-time.
- Ignite allows integration not only with Neptune but also MLflow, Polyaxon, TensorBoard, and more.
Integration with Neptune
The Neptune integration is very easy. All you need to do is pip install the neptune-client library, and then you simply call the NeptuneLogger from ignite.contrib.handlers.neptune_logger.
from ignite.contrib.handlers.neptune_logger import *
npt_logger = NeptuneLogger(api_token="ANONYMOUS",
project_name='shared/pytorch-ignite-integration',
name='ignite-mnist-example',
params={'train_batch_size': train_batch_size,
'val_batch_size': val_batch_size,
'epochs': epochs,
'lr': lr,
'momentum': momentum})
Interestingly, you can attach a lot of event handlers so that all data will be displayed in the Neptune dashboard which will help you monitor the training.
Below you’ll find some examples of how you can attach event handlers with Neptune.
npt_logger.attach(trainer,
log_handler=OutputHandler(tag="training",
output_transform=lambda loss: {'batchloss': loss},
metric_names='all'),
event_name=Events.ITERATION_COMPLETED(every=100))
npt_logger.attach(train_evaluator,
log_handler=OutputHandler(tag="training",
metric_names=["loss", "accuracy"],
another_engine=trainer),
event_name=Events.EPOCH_COMPLETED)
npt_logger.attach(validation_evaluator,
log_handler=OutputHandler(tag="validation",
metric_names=["loss", "accuracy"],
another_engine=trainer),
event_name=Events.EPOCH_COMPLETED)
Community
The Ignite community is growing; there are almost 124 contributors and more than 391 active users at the time of writing.
When to use PyTorch Ignite
- High-level library with great interface, with additional property to customise the Ignite API according to requirements.
- When you want to factorize the code but don’t want to sacrifice on flexibility to support your complicated training strategies
- Provides a rich environment of utility support like metrics, handlers, and loggers available to evaluate/debug your model with ease, which can be configured in isolation.
When not to use PyTorch Ignite
- If you’re not familiar with Pytorch.
- If you’re not well-versed with distributed training, just want to use it with ease.
- If you don’t want to spend a lot of time learning a new library.
Code comparison: Pytorch vs Ignite
Pure PyTorch
model = Net()
train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.8)
criterion = torch.nn.NLLLoss()
max_epochs = 10
validate_every = 100
checkpoint_every = 100
def validate(model, val_loader):
model = model.eval()
num_correct = 0
num_examples = 0
for batch in val_loader:
input, target = batch
output = model(input)
correct = torch.eq(torch.round(output).type(target.type()), target).view(-1)
num_correct += torch.sum(correct).item()
num_examples += correct.shape[0]
return num_correct / num_examples
def checkpoint(model, optimizer, checkpoint_dir):
filepath = "{}/{}".format(checkpoint_dir, "checkpoint.pt")
obj = {"model": model.state_dict(), "optimizer":optimizer.state_dict()}
torch.save(obj, filepath)
iteration = 0
for epoch in range(max_epochs):
for batch in train_loader:
model = model.train()
optimizer.zero_grad()
input, target = batch
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if iteration % validate_every == 0:
binary_accuracy = validate(model, val_loader)
print("After {} iterations, binary accuracy = {:.2f}"
.format(iteration, binary_accuracy))
if iteration % checkpoint_every == 0:
checkpoint(model, optimizer, checkpoint_dir)
iteration += 1
PyTorch-Ignite
model = Net()
train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.8)
criterion = torch.nn.NLLLoss()
max_epochs = 10
validate_every = 100
checkpoint_every = 100
trainer = create_supervised_trainer(model, optimizer, criterion)
evaluator = create_supervised_evaluator(model, metrics={'accuracy': Accuracy()})
@trainer.on(Events.ITERATION_COMPLETED(every=validate_every))
def validate(trainer):
evaluator.run(val_loader)
metrics = evaluator.state.metrics
print("After {} iterations, binary accuracy = {:.2f}"
.format(trainer.state.iteration, metrics['accuracy']))
checkpointer = ModelCheckpoint(checkpoint_dir, n_saved=3, create_dir=True)
trainer.add_event_handler(Events.ITERATION_COMPLETED(every=checkpoint_every),
checkpointer, {'mymodel': model})
trainer.run(train_loader, max_epochs=max_epochs)
As you can see, Ignite compresses the pytorch code and makes you more productive in the research area where you can practice different techniques while keeping track and manipulating the engineering aspect i.e. training of the model.
Conclusion
Both Lightning and Ignite are good in their own ways. If you’re looking for flexibility, then Ignite is good because you can use conventional Pytorch to design your architecture, optimizers, and experiment as a whole. Ignite will help you assemble different components in a particular function.
If you’re looking for fast prototyping for a new design, or doing research on state-of-the-art ML methods, then go with Lightning. It will help you focus on the research aspect, and it will help you scale up the model faster where you can reduce error. Also, it provides TPU and parallel distributions.
I hope you enjoyed this article. If you wanna try out some practical examples, follow the notebook links for both Lightning and Ignite (in the comparison table at the beginning of the article).
Thanks for reading!