MLOps Blog

Building MLOps Pipeline for Computer Vision: Image Classification Task [Tutorial]

18 min
4th August, 2023

The introduction of Transformers in 2018 by Vaswani and the team brought a significant transformation in the research and development of deep learning models for various tasks. The transformer leverages a self-attention mechanism that was adopted from the attention mechanism by Bahdanau and the team. With this mechanism, one input could interact with other inputs enabling it to focus or pay attention to the important features of the data. 

Because of this, transformers were able to achieve state-of-the-art results in various NLP tasks like machine translation, summary generation, text-generation, et cetera. It has also replaced RNN and its variants in almost all the NLP tasks. As a matter of fact, with its success in NLP, transformers are now being adopted in computer vision tasks as well. In 2020, Dosovitskiy and his team developed vision transformers (ViT), where they argued that reliance on CNN is not necessary. Based upon this premise, in this article, we will explore and learn how ViT can help in the task of image classification.  

This article is a guide aimed at building an MLOps pipeline for a computer vision task using ViT, and it will focus on the following areas with respect to a typical data science project:

  1. Aim of the project
  2. Hardware specification
  3. Attention visualization 
  4. Building the model and experiment tracking
  5. Testing and inference
  6. Creating a Streamlit app for deployment
  7. Setting up CI/CD using GitHub actions
  8. Deployment and monitoring

Read also

Β Β Building MLOps Pipeline for Time Series Prediction [Tutorial]

Β Building MLOps Pipeline for NLP: Machine Translation Task [Tutorial]

The code for this article can be found on this Github Link so that you can follow along. Let’s get started. 

MLOps pipeline for image classification: understanding the project

Understanding the requirements of the project or the client is an important step as it can help us brainstorm ideas and research various components that the project might require, such as the latest papers, repositories, relevant work, datasets, and even cloud-based platforms for deployment. This section will focus on 2 topics: 

  • 1 Aim of the project.
  • 2 Hardware for accelerated training.

Aim of the project: bird image classifier 

The aim of the project is to build an image classifier to classify different species of birds. Since this model will be later deployed in the cloud, we must keep in mind that the model must be trained to achieve a good accuracy score in both training and testing datasets. In order to do that, we should use metrics like precision, recall, confusion metrics, F1, and AUROC score to see how the model is performing on both datasets. Once the model achieves good scores on the test dataset, we will then create a web app to deploy it on a cloud-based server. 

Learn more:

Β F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

In a nutshell, this is how the project will be executed:

  • 1 Building the deep learning model with Pytorch
  • 2 Testing the model
  • 3 Creating a Streamlit app
  • 4 Creating directories and their respective config files for deployment
  • 5 Finally, deploying it on the Google Cloud Platform

This project will include some of the additional practices that you will find in this article, such as: 

  • Live tracking to monitor metrics,
  • Attention visualization,
  • Directory structure,
  • Code formatting for all the python modules. 

Hardware for accelerated training

We will conduct our experiment with two sets of hardware:

  1. M1 Macbook: The efficiency of Apple’s M1 processors will allow us to quickly develop models and train them on a smaller dataset. Once the training is done, we can start building a web application on our local machine and create a small pipeline of data ingestion, data preprocessing, model prediction, and attention visualization before scaling up the model in the cloud. 

Note: if you have one of these M1 laptops, then make sure to check the installation process in my Github repo.

  1. Kaggle or Google Colab GPUs: Once our code is working properly in our local machine and the pipeline is created, we can scale it up and train the whole model for a longer period in Google Colab or Kaggle which are free. Once the training is done, we can then download new weights and metadata to our local computer and test whether the web application is performing well in the unseen data before deploying it to the cloud. 

Now let’s start the implementation. 

MLOps pipeline for image classification: data preparation

The first step of implementing a deep learning project is to plan the different python modules that we are going to have. Although we will be using the Jupyter notebook for experimentation, it is always a good idea to have everything laid out before starting to code. Planning might include reference code repositories as well as research papers. 

It is always a good idea to create the directory structure for the project for efficiency and for ease of navigation.  

ViT Classification
β”œβ”€β”€ notebooks
β”‚   └── ViT.ipynb
└── source

In our case, the main directory is called the ViT Classification, which contains two folders: 

  1. Notebooks: This is where all the experimentation with jupyter notebook will reside.
  2. Source: This is where all the Python modules will reside. 

As we progress, we will keep adding Python modules to the source directory, and we will also create different sub-directories for storing metadata, docker files, files, et cetera. 

Building the image classification model

As mentioned before, research and planning is the key to implementing any machine learning project. What I usually do first is, create a to store all the parameters with respect to data preprocessing, model training and inference, visualization, et cetera.

class Config:
   #Image configuration
   IMG_SIZE = 32
   PATCH_SIZE = 10
   CROP_SIZE = 100
   DATASET_SAMPLE = 'full'

   #opimizer configuration
   LR = 0.003
   OPIMIZER = 'Adam'

   #Model configuration
   NUM_CLASSES = 400
   HIDDEN_SIZE = 768
   LINEAR_DIM = 3072
   NUM_LAYERS = 12

   STD_NORM = 1e-6
   EPS = 1e-6
   MPL_DIM = 128
   OUTPUT = 'softmax'
   LOSS_FN = 'nll_loss'

   #Device configuration
   DEVICE = ["cpu","mps","cuda"]

   #Training configuration
   N_EPOCHS = 1

The above code block gives a vague idea of what the parameters should look like. As we make progress, we can keep adding more parameters. 

Note: In the device configuration section, I have given a list of three hardware: CPU, MPS, and CUDA. MPS or Metal Performance Shaders is the hardware type to train on M1 Macbooks.  


The dataset that we will use is the bird classification dataset which can be downloaded from Kaggle. The dataset consists of 400 classes of birds with three subsets: training, validation, and testing, each containing 58388, 2000, and 2000 images, respectively. Once the data has been downloaded, then we can then create a function to read and visualize the images. 

sample from the datase
The image above is a sample from the dataset along with the class that it belongs to | Source

Preparing the data

We can move ahead to create a data loader that transforms the images into image tensors. Along with that, we will also perform resizing, image cropping, and normalizing as well. Once the preprocessing is done, we can then use the DataLoader function to automatically generate data for training in batches. The following pseudo function will give you an idea of what we are trying to achieve, you can find the full code in the link provided in the code heading:

#apply the desired transformations on dataset and split it into train, validation, and test set.

def Dataset(bs, crop_size, sample_size='full'):
      return train_data, valid_data, test_data

The above function has a sample size argument that allows the creation of a sub-set of the training dataset for testing purposes on your local machine. 

MLOps pipeline for image classification: building the vision transformer using Pytorch

I have created the full model as per the author’s description of ViT in their paper. This code is inspired by jeonsworld repo, I have added a few more details and edited some of the lines of code for the purpose of this task. 

The model that I have created is divided into 9 modules, and each module can be executed independently for various tasks. We will explore each section in order for ease of understanding. 


Transformers and all the natural language model has an important component called embedding. Its function is usually to capture semantic information by grouping similar information together. Apart from that embeddings can be learned and reused across models. 

In ViT, embeddings serve the same purpose by retaining positional information which can be fed into the encoder. Again the following pseudo-code will help you to understand what’s going on and you can also find the full code in the link provided in the code heading.

class Embeddings(nn.Module):

#Construct the embeddings from patch, position embeddings.
   def __init__(self, img_size:int, hidden_size:int, in_channels:int):

#create a CONV2D object for creation of embeddings 
   def forward(self, x):

#calculate and return embeddings
       return embeddings

Note that the embedding patches for the image can be created using convolution layers. It is quite efficient and easy to modify as well. 


The encoder is made up of a number of attention blocks which itself has two important modules:

  • 1 Self Attention Mechanism
  • 2 Multi-layer perceptron (MLP)

Self attention mechanism

Let’s start with the self-attention mechanism. 

The self-attention mechanism is the core of the whole system. It enables the model to focus on the important feature of the data. It does so by operating on a single embedding at different positions to compute the representation of the same sequence. You can find the link to the entire code below to get a deeper picture.

#Calculate the attention and return the attention output along with the weights

class Attention(nn.Module):
       return attention_output, weights

The output of the attention block will yield the attention output as well the attention weights. The latter will be used to visualize the ROI that is calculated using the attention mechanism. 

Multilayer perceptron

Once we receive the attention output, we can then feed it into the MLP, which will give us a probability distribution for the classification. You can get an idea of the entire process in the forward function. To see the full code click the link provided in the code heading below.

#Apply a linear transformation to the incoming attention output using the GELU activation function.

class Mlp(nn.Module):
   def __init__(self, hidden_size, linear_dim, dropout_rate, std_norm):
       return x

It is worth noting that we are using the GELU as our activation function. 

activation function
GELU as activation function | Source

One of the pros of using GELU is that it avoids vanishing gradient, which makes the model easy to scale. 


The attention block is the module where we assemble both the modules: the self-attention module and the MLP modules.

#Returns the calculated sum of attention scores via MLP along with attention weights.

class Block(nn.Module):
       return x, weights

This module will also yield the attention weights directly from the attention mechanism along with the distribution yielded by MLP. 

Now let’s briefly understand the encoder. The Encoder essentially enables us to create multiple attention blocks that give the transformer more control over the attention mechanism. The three components: Encoder, Transformer, and ViT are written in the same module i.e.,

#Creates multiple layers of attention blocks and returns encoded state and attention weights. 

class Encoder(nn.Module):
       return encoded, attn_weights


After assembling the attention block we can then code our transformer. The attention block transformer is an assembly of the embedding module and encoder module. 

class Transformer(nn.Module):
   def __init__(self, img_size, hidden_size, in_channels, num_layers,
                num_attention_heads, linear_dim, dropout_rate, attention_dropout_rate,
                eps, std_norm):
       super(Transformer, self).__init__()
       self.embeddings = Embeddings(img_size, hidden_size, in_channels)
       self.encoder = Encoder(num_layers, hidden_size, num_attention_heads,
                              linear_dim, dropout_rate, attention_dropout_rate,
                              eps, std_norm)

   def forward(self, input_ids):
       embedding_output = self.embeddings(input_ids)
       encoded, attn_weights = self.encoder(embedding_output)
       return encoded, attn_weights

Vision transformer

Finally, we can code our vision transformer which involves two components: the transformer and the final linear layer. The final linear will help us to find the probability distribution over all the classes. It can be described as:

class VisionTransformer(nn.Module):
   def __init__(self, img_size, num_classes, hidden_size, in_channels, num_layers,
                num_attention_heads, linear_dim, dropout_rate, attention_dropout_rate,
                eps, std_norm):
       super(VisionTransformer, self).__init__()
       self.classifier = 'token'

       self.transformer=Transformer(img_size, hidden_size, in_channels,
                                    num_layers, num_attention_heads, linear_dim,
                                    dropout_rate, attention_dropout_rate, eps,
       self.head = Linear(hidden_size, num_classes)

   def forward(self, x, labels=None):
       x, attn_weights = self.transformer(x)
       logits = self.head(x[:, 0])

       if labels is not None:
           loss_fct = CrossEntropyLoss()
           loss = loss_fct(logits.view(-1, 400), labels.view(-1))
           return loss
           return logits, attn_weights

Please notice that the network is going to consistently yield attention weights which will be useful for visualizing the attention maps. 

Here is a bonus tip. If you want to see the architecture of the model and how the inputs are being operated then use the following line of code. The code will generate a full operational architecture for you. 

from torchviz import make_dot
x = torch.randn(1,config.IN_CHANNELS*config.IMG_SIZE*config.IMG_SIZE)
x = x.reshape(1,config.IN_CHANNELS,config.IMG_SIZE,config.IMG_SIZE)
logits, attn_weights = model(x)
make_dot(logits, params=dict(list(model.named_parameters()))).render("../metadata/VIT", format="png")

You can find the image in the given link

But in nutshell, this how the architecture looks like. 

vision transformer
The architecture of vision transformer | Source

MLOps pipeline for image classification: training vision transformer using Pytorch

The training module is where we will assemble all the other modules like the config module, preprocessing module, and Transformer and log the parameters including the metadata into the Neptune API. One easiest way to log parameters is to use Config.__dict__. This automatically converts a class into a dictionary. 

You can later create a function that removes unnecessary attributes from the dictionary. 

def neptune_monitoring():
   PARAMS = {}
   for key, val in Config.__dict__.items():
       if key not in ['__module__', '__dict__', '__weakref__', '__doc__']:
           PARAMS[key] = val
   return PARAMS


The training function is quite straightforward and simple to write. I have included both training and evaluation in the pseudo-code. You can find the full training block here, or you can click the code heading below.

def train_Engine(n_epochs, train_data, val_data, model, optimizer, loss_fn, device,

#Initiates the training procedure while tracking accuracy and loss over each iterations. 

Now our training loop is completed, we can then start the training and log the metadata into the dashboard, which we can use for monitoring the training on the go, saving charts and parameters, and sharing them with teammates.

if __name__ == '__main__':
   from preprocessing import Dataset
   from config import Config
   config = Config()
   params = neptune_monitoring(Config)

   run = neptune.init_run(project="nielspace/ViT-bird-classification",
   run['parameters'] = params

   model = VisionTransformer(img_size=config.IMG_SIZE,

   train_data, val_data, test_data = Dataset(config.BATCH_SIZE, config.IMG_SIZE,

   optimizer = optim.Adam(model.parameters(), lr=0.003)
   train_Engine(n_epochs=config.N_EPOCHS, train_data=train_data, val_data=val_data,
               model=model,optimizer=optimizer, loss_fn='nll_loss',
               device=config.DEVICE[1], monitoring=True)

Note: The prototyping of this model was done in Macbook Air M1 on a smaller dataset with 10 classes. The prototyping stage is where I tried different configurations and played with the architecture of the model. Once I was satisfied I used Kaggle to train the model. Since the dataset has 400 classes, the model needed to be larger and trained for a longer period of time.  

Experiment tracking

In the prototyping stage, experiment tracking becomes a very handy and reliable source to make further changes to your model. You can keep an eye on your model’s performance during training and subsequently make necessary tweaks to it until you get a high-performing model.

The Neptune API enables you to:

If you want to log your metadata in the system, then import the Neptune API and call the init function. Following that, enter the API key provided for the project, and you are good to go. Get to know more about how to get started with Neptune here. Also, here is the Neptune dashboard, which has the metadata related to this project.

run = neptune.init_run(project="nielspace/ViT-bird-classification",

Once you are done with the initialization, you can start logging. For instance, if you want to:

  1. Upload the parameters, use: run[‘parameters’] = params.
    Note: make sure that the params are of dictionary class.
  2. Upload metrics, use: run[‘Training_loss’].log(loss.item())and run[‘Training_loss’].log(loss.item())
  3. Upload model weights, use: run[“model_checkpoints/ViT”].upload(“”)
  4. Upload images, use: run[“val/conf_matrix”].upload(“confusion_matrix.png”)

Depending upon what you are optimizing your model for, there are plenty of things that you can log and track. In our case, we put an emphasis on training and validation loss and accuracy.

Logging metadata and dashboard

In the ongoing training process, you can then monitor the model’s performance. With each iteration, the graph will update. 

Along with the model’s performance, you will also find CPU and GPU performance as well. See the image below. 

You can also find all the model metadata as well. 

model metadata
The model metadata

Scaling using Kaggle

Now, let’s scale the model. We will use Kaggle for this project because it is free and also because the dataset was downloaded from Kaggle so it will be easy to scale and train the model on the platform itself. 

  1. The first thing we need to do is to upload the model and change the directory path to Kaggle-specific paths and enable the GPUs. 
  1. Note that the model must be complex in order to capture relative information for prediction. You can start scaling the model by gradually increasing the number of hidden layers and seeing how the model behaves. You may not want to touch other parameters like the number of attention heads and hidden size because it may throw up arithmetic errors. 
  1. For each change, you make the model run for at least two epochs in small data batches with all the 400 classes and observe if the accuracy is increasing. Typically, it will increase. 
  1. Once satisfied, run the model for 10 to 15 epochs which would take around 5 hours for the subset of 30000 samples. 
  1. After the training, check its performance on the test dataset, and if it performs well, then download the model weights. At this point, the size of the model should be around 650 MB for 400 classes. 

Attention visualization

As mentioned before, self-attention is the crux of the whole Vision Transformer architecture, and interestingly there is a way to visualize it as well. The source code of the attention map can be found here. I have modified it a bit and created it as a separate independent module that can use the output of the transformer to yield the attention maps. The idea here is to store the input image and its corresponding attention map image and display it in the file. (Source)

def attention_viz(model, test_data, img_path=PATH, device='mps'):

#Visualizes the attention mask of a given input (image) by comparing it with the original image. 

We can run this code by simply calling the attention_viz function and passing the corresponding arguments. 

if __name__ == '__main__':
   train_data, val_data, test_data = Dataset(config.BATCH_SIZE,config.IMG_SIZE, config.DATASET_SAMPLE)
   model = torch.load('metadata/models/model.pth', map_location=torch.device('cpu'))
   attention_viz(model, test_data, PATH)
Attention Visualization
The image above is an example of attention visualization. The image on the left is the original image whereas the image on the right is overlaid with the attention map. The region i.e. the face of the bird is quite bright as that area constitutes the features to which the model is paying attention 

Testing and inference

We can also use the attention_viz function in the test module, where we will test the model on the test data and measure the model’s performance on various metrics like confusion matrix, accuracy score, f1 score, recall score, and precision score.

def test(model, test_data):
   return logits_, ground, confusion_matrix

#Evaluates the model’s performance on the test dataset and returns the confusion matrix, logits and ground truth for further performance evaluation. 

We can easily generate a confusion matrix and visualize using heatmap from seaborn and save it in the results folder, which we can also use to display it on the file. 

confusion matrix
Above is the image of a confusion matrix that is of the shape 100X100 trained for 50 epochs. As you can see the model is quite efficient to predict true positives which can be seen in the diagonals in white color. But there are few false positives across the graph which means that the model still makes wrong predictions

We can also generate the accuracy and loss graph and store it in the results folder as well. Consequently, we can use Sklearn to find other metrics, but before that, we must convert the tensors array into a NumPy array.

probs = torch.zeros(len(logits_))
y_ = torch.zeros(len(ground))
idx = 0
for l, o in zip(logits_, ground):
   _, l = torch.max(l, dim=1)
   probs[idx] = l
   y_[idx] = o.item()

prob =
y_ =

print(accuracy_score(y_, prob))
print(cohen_kappa_score(y_, prob))
print(classification_report(y_, prob))

Once we are satisfied with the model’s performance, we can then do inference by simultaneously creating a Streamlit app. 

MLOps pipeline for image classification: creating the app using Streamlit

The Streamlit app will be a web app that we will deploy on the cloud. In order to build the app, we must first pip install streamlit followed by importing the library in the new module. 

The module will contain the same module as the inference module we just need to copy and paste the evaluation function as it is and then build the app using the Streamlit library. Below is the code of the app.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from PIL import Image
import torch
from torchvision import transforms
import torch
import streamlit as st

from embeddings import Embeddings
from attention_block import Block
from linear import Mlp
from attention import Attention
from transformer import VisionTransformer, Transformer, Encoder

from config import Config
config = Config()

st.set_option('deprecation.showfileUploaderEncoding', False)
st.title("Bird Image Classifier")

# enable users to upload images for the model to make predictions
file_up = st.file_uploader("Upload an image", type = "jpg")

def predict(image):
   """Return top 5 predictions ranked by highest probability.
   :param image: uploaded image
   :type image: jpg
   :rtype: list
   :return: top 5 predictions ranked by highest probability
   model = torch.load('model.pth')

   # transform the input image through resizing, normalization
   transform = transforms.Compose([
           mean = [0.485, 0.456, 0.406],
           std = [0.229, 0.224, 0.225])])

   # load the image, pre-process it, and make predictions
   img =
   x = transform(img)
   x = torch.unsqueeze(x, 0)
   logits, attn_w = model(x)

   with open('../metadata/classes.txt', 'r') as f:
       classes ='n')

   # return the top 5 predictions ranked by highest probabilities
   prob = torch.nn.functional.softmax(logits, dim = 1)[0] * 100
   _, indices = torch.sort(logits, descending = True)
   return [(classes[idx], prob[idx].item()) for idx in indices[0][:5]]

if file_up is not None:
   # display image that user uploaded
   image =
   st.image(image, caption = 'Uploaded Image.', use_column_width = True)
   labels = predict(file_up)

   # print out the top 5 prediction labels with scores
   for i in labels:
       st.write(f"Prediction {i[0]} score {i[1]:.2f}")

But before we deploy, we must test it locally. In order to test the app, we will run the following command:

streamlit run

Once the above command is executed, you will get the following prompt:

You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL:

Copy the URL and paste it into your browser, and the app is online (locally). 

Bird image classifier
Copied URL

Upload the image for classification. 

Uploaded image
Uploaded image

With the ViT model trained and the app ready our directory structure should look something like this now:

β”œβ”€β”€ metadata
β”‚   β”œβ”€β”€ Abbott's_babbler_(Malacocincla_abbotti).jpg
β”‚   β”œβ”€β”€ classes.txt
β”‚   β”œβ”€β”€ models
β”‚   β”‚   └── model.pth
β”‚   └── results
β”‚       β”œβ”€β”€ accuracy_loss.png
β”‚       β”œβ”€β”€ attn.png
β”‚       └── confusion_matrix.png
β”œβ”€β”€ notebooks
β”‚   β”œβ”€β”€ ViT.ipynb
β”‚   └──
└── source

Now we proceed toward deploying the app. 

MLOps pipeline for image classification: code formatting

First, let’s format our Python scripts. For that, we will use Black. Black is a Python script formatter. All you need to do is pip install black and then run `black ` following the name of the python module or even the whole directory. For this project, I ran black followed by the source directory which contains all the python modules. 

ViT-Pytorch git:(main) black source
Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install black[jupyter]``
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/
reformatted source/

The advantage of using black is that it removes unnecessary spaces, adds double quotes instead of single quotes, and makes reviewing code faster and more efficient. 

Given below are the images of before and after using black to format the code. 

Examples before and after using black to format the code
Examples before and after using black to format the code

As you can see that unnecessary spaces have been removed. 

MLOps pipeline for image classification: setting up CI/CD 

For our CI/CD process, we will be using Github Actions, and Google Cloud Build to integrate and deploy our Streamlit app. The following are the steps that will help you to create a full MLOps pipeline. 

Creating the Github Repository

The first step is to create the Github repository. But before that we must create three important files:

  • 1 requirements.txt
  • 2 makefile
  • 3 main.yml


The requirements.txt file must contain all the libraries that the model is using. There are two ways in which you can create a requirements.txt file. 

  1. If you have a dedicated working environment created specifically for this project, then you can run pip freeze>requirements.txt and it will create a requirements.txt file for you. 
  2. If you have a general working environment, then you can run pip freeze and copy-paste the libraries that you have been working on.

The requirement.txt file for this project looks like this:


Note: Always make sure that you mention the version so that in the future, the app remains stable and performs optimally. 


In a nutshell, Makefile is a command prompt file that automates the whole process of installing libraries, and dependencies, running a Python script, et cetera. A typical Makefile looks something like this:

   python3 -m venv ~/.visiontransformer
   source ~/.visiontransformer/bin/activate
   cd .visiontransformer
   pip install --upgrade pip &&
       pip install -r requirements.txt
   python source/
all: install run

For this project, our Makefile will have three processes:

  • 1 Setup virtual environment and activate it.
  • 2 Install all the Python libraries.
  • 3 Run a test file.

Essentially every time we make a new commit, the makefile will be executed, which will automatically run the module generating the latest performance metrics and updating the file.

But Makefile will only work if we create an action trigger. So let’s create that.

Action trigger: .github/workflow/main.yml

To create an action trigger, we need to create the following directory: .github/workflow, followed by creating a main.yml file. The main.yml will essentially create an action trigger whenever the repo is updated. 

Our aim is to continuously integrate any changes made in the existing build, like updating parameters, model architecture, or even the UI/UX. Once the change is detected, it will automatically update the file. The main.yml for this project is designed to trigger the workflow on any push or pull request but only for the main branch.

At each new commit, the file will activate the ubuntu-latest environment, install the specific python version and then execute a specific command from the Makefile. 


name: Continuous Integration with Github Actions

   branches: [ main ]
   branches: [ main ]

   runs-on: ubuntu-latest
   # Steps represent a sequence of tasks that will be executed as part of the job
     - uses: actions/checkout@v2
     - name: Set up Python 3.8
       uses: actions/setup-python@v1
         python-version: 3.8
     - name: Install dependencies
       run: |
         make install
         make run


After the files are created, you can push the entire codebase to Github. Once uploaded, you can click on the Actions tab and see the build-in progress for yourself. 

Build-in progress in the Actions tab

Deployment: Google Cloud Build

After the testing is done and all the logs and results are updated in the Github file, we can move to the next step, which is to integrate the app into the cloud. 

  1. First, we will visit:, and then we will create a new project in the dashboard and name it Vision Transformer Pytorch. 
Creating a new project
Creating a new project

Once the project is created, you can navigate into the project, and it will look something like this:

The project
The project

As you can see, google cloud build offers us various services right out of the box like a virtual machine, big query, GKE, or Kubernetes cluster on the project home page. But before we create anything in the cloud build we must enable the Kubernetes cluster and create a certain directory and their respective files in the project directory.

  1. Kubernetes

Let’s set up our Kubernetes cluster before we create any files. To do that, we can search GKE in the google cloud console search bar and enable the API. 

Setting up Kubernetes cluster
Setting up Kubernetes cluster

Once the API is enabled, we will be navigated to the following page. 

Kubernetes cluster
Kubernetes cluster

But instead of creating the clusters manually, we will create them using the inbuild cloud shell. To do that, click on the terminal button on the top right hand, and check the image below. 

Cloud shell
Activating Cloud Shell
Creating cluster by using inbuild cloud shell
Creating cluster by using inbuild cloud shell

After activating the cloud shell, we can type the following command to create Kubernetes clusters:

gcloud container clusters create project-kube --zone "us-west1-b" --machine-type "n1-standard-1" --num-nodes "1"

This usually takes up to 5 minutes. 

Creating Kubernetes clusters
Creating Kubernetes clusters

After it is completed, it will look something like this: 

Kubernetes clustering completed
Kubernetes clustering completed

Now let’s set up the two files that will configure the Kubernetes clusters: deployment.yml and service.yml. 

The deployment.yml file allows us to deploy the model in the cloud. The deployment can be canary, recreate, blue-green or any other depending upon the requirement. In this example, we will overwrite the deployments. This file also helps in scaling the model efficiently using the arguments replicas. Here is an example of a deployment.yml file.



apiVersion: apps/v1
kind: Deployment
 name: imgclass
 replicas: 1
     app: imageclassifier
       app: imageclassifier
     - name: cv-app
       - containerPort: 8501

The next file is the service.yml file. It essentially connects the app from the container to the real world. Notice the containerPort argument is specified as 8501, we will use the same number in our service.yml for the targetPort argument. This is the same number that Streamlit uses to deploy the application. Apart from that, the app argument is the same in both files. 



apiVersion: v1
kind: Service
 name: imageclassifier
 type: LoadBalancer
   app: imageclassifier
 - port: 80
   targetPort: 8501

Note: Always make sure that the name of the app and the version are in lower cases. 

  1. Dockerfile

Now let’s configure the Dockerfile. This file will create a Docker container that will host our Streamlit app. Docker is very much required since it wraps the app in an environment that is easy to scale. A typical Dockerfile looks like this:


FROM python:3.8.2-slim-buster

RUN apt-get update

COPY . ./

RUN ls -la $APP_HOME/

# Install dependencies
RUN pip install -r requirements.txt

# Run the streamlit on container startup
CMD [ "streamlit", "run","" ]

Dockerfile contains a series of commands that:

  • Installs the Python version. 
  • Copies the local code to the container image.
  • Installs all the libraries.
  • Executes Streamlit app. 

Note that we are using Python 3.8 as some of the dependencies use the latest Python version.

  1. cloudbuild.yaml

In Google Cloudbuild cloudbuild.yml file stitches all the artefacts together to create a seamless pipeline. It has three primary steps:

  • Build a Docker container using the Dockerfile from the current directory. 
  • Push the container to the google container registry.
  • Deploy the container in the Kubernetes engine. 


- name: ''
 args: ['build', '-t', '', '.']
 timeout: 180s
- name: ''
 args: ['push', '']
- name: ""
 - run
 - --filename=kubernetes/ #this argument connects the files in kubernetes directory
 - --location=us-west1-b
 - --cluster=project-kube

Note: Please cross-check the arguments like the container name in deployment.yml and cloudbuild.yml file. Along with that also cross-check the cluster name that you created earlier with the cluster name in the clouldbuild.yml file. Lastly, make sure that the filename argument is as same as the Kubernetes directory where the deployment.yml and service.yml are present.  

After creating the files the file structure of the entire project should look like this:

β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .github/workflow/main.yml
β”œβ”€β”€ Makefile
β”œβ”€β”€ cloudbuild.yaml
β”œβ”€β”€ kubernetes
β”‚   β”œβ”€β”€ deployment.yml
β”‚   └── service.yml
β”œβ”€β”€ metadata
β”‚   β”œβ”€β”€ Abbott's_babbler_(Malacocincla_abbotti).jpg
β”‚   β”œβ”€β”€ classes.txt
β”‚   β”œβ”€β”€ models
β”‚   β”‚   └── model.pth
β”‚   └── results
β”‚       β”œβ”€β”€ accuracy_loss.png
β”‚       β”œβ”€β”€ attn.png
β”‚       └── confusion_matrix.png
β”œβ”€β”€ notebooks
β”‚   β”œβ”€β”€ ViT.ipynb
β”‚   └──
β”œβ”€β”€ requirements.txt
└── source
    └── vit-pytorch.ipynb
  1. Cloning and testing

Now let’s clone the GitHub repo in our google cloud build project, cd into it, and run the cloudbuild.yml file. Use the following commands:

clone the GitHub repo
Cloning the GitHub repo
  • gcloud builds submit –config cloudbuild.yaml

The deployment process will look something like this:

The deployment process
The deployment process
  1. The deployment takes somewhere around 10 minutes, depending on various factors. And if everything is executed properly, you will see that the steps are color-coded with green ticks. 
Succcessful deployment
Succcessful deployment
  1. Once the deployment is successful, you can find the endpoints of the app in the Services & Ingress tab in the Kubernetes Engine. Click on the endpoints, and it will navigate you to the Streamlit app. 
The endpoints
The endpoints
The Streamlit app
The Streamlit app

Additional tips:

  1. Make sure that you use lowercases for app name and project id in all your *.yml config files.
  2. Cross-check the arguments for all *.yml config files. 
  3. Since you are copying your repo in a virtual environment, cross-check all the directory and file paths. 
  4. In case of an error in the cloud build process, look for a command which will help you resolve the error you find in the error statement. See the image below for a better understanding; I have highlighted the command that needs to be executed before I re-run the cloud build command. 
an error in the cloud build process
An error in the cloud build process

Cloud build Integration

Now we will integrate the Google cloud build into the Github repo. This will create a trigger action that will update the build whenever a change is being made in the repo. 

  1. Search Google Cloud Build in the Marketplace
Searching for Google Cloud Build
Searching for Google Cloud Build
  1. Select the repo that you want to connect. In this case, it will be ViT-Pytorch and save it. 
Selecting the repo
Selecting the repo
  1. In Google Cloud Build, we will go to the Cloud build page and click on the Triggers tab to create triggers. 
creating triggers
Creating triggers
  1. After clicking on create trigger, we will be navigated to the page below. There we will mention the trigger name, select the event which will trigger the cloudbuild.yml file, and select the project repository. 
Trigger settings
Trigger settings
  1. Follow the authentication process. 
authentication process
Authentication process
  1. Connect the repository. 
Connecting the repository
Connecting the repository
  1. Finally, create the trigger. 
creating the trigger
Creating the trigger

Now that the trigger is created, all the changes that you make in the Github repo will be automatically detected, and the deployment will be updated. 

Created trigger
Created trigger

Monitoring the model-decay

Over time the model will decay, which will affect the prediction capabilities. We need to monitor the performance on a regular basis. One way to do that is to occasionally test the model on the new dataset and evaluate the same on metrics that I mentioned earlier, like F1 score, Accuracy score, Precision score, et cetera. 

Another interesting way to monitor the model’s performance is to use the AUROC metric, which measures the discriminative performance of the model. Because this project is a multiclassification project, you can convert it into a binary classification project and then check the model’s performance. If the performance of the model has decayed, then the model must be trained again with new samples and larger samples. And if it really required, then modify the architecture as well. 

Here is the link to the code, which will allow you to measure the AUROC score.


In this article, we learned to build an image classifier app with Vision Transformer using Pytorch and Streamlit. We also saw how we can deploy the app on the Google Cloud Platform using Github Actions and technologies like Kubernetes, Dockerfile, and Makefile. 

Important takeaways from this project:

  1. Bigger data requires a larger model, which essentially requires training for more epochs. 
  2. When creating a prototyping experiment, reduce the number of classes and test whether the accuracy increases with each epoch. Try different configurations till you are confident that the model’s performance is increasing before using GPUs on cloud services like Kaggle or Colab. 
  3. Use various performance metrics like confusion metrics, precision, recall, confusion metrics, f1, and AUROC. 
  4. Once the model is deployed, monitoring of the model can be done occasionally and not frequently. 
  5. In order to monitor, using performance metrics like the AUROC score is good since it automatically creates threshold values and graphs the model’s True Positive rate and False Positive rate. With the AUROC score, the model’s previous and current performance can be easily compared. 
  6. Re-training the model should be done only when the model has drifted significantly. Since a model like this requires a lot of computational resources, frequent retraining can be expensive.

I hope you found this article informative and practical. You can find the entire code in this Github repo. Feel free to share it with others as well. 


  1. An Image Is Worth 16×16 Words: Transformers For Image Recognition At Scale
  2. Are Transformers More Robust Than CNNs?
  5. ​​
  7. Building MLOps Pipeline for NLP: Machine Translation Task

Was the article useful?

Thank you for your feedback!