We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

4 Ways Machine Learning Teams Use CI/CD in Production

One of the core concepts in DevOps that is now making its way to machine learning operations (MLOps) is CI/CD—Continuous Integration and Continuous Delivery or Continuous Deployment. CI/CD as a core DevOps practice embraces tools and methods to deliver software applications reliably by streamlining the building, testing, and deployment of your applications to production. Let’s define these concepts below:

  • Continuous delivery (CD) is the practice of deploying every build to a production-like environment and performing automated integration and testing of the application before it is deployed.
  • Continuous deployment (CD) compliments continuous integration with additional steps by automating the configuration and deployment of the application to a production environment.
Continuous Integration vs Continuous Delivery vs Continuous Deployment
Continuous Integration vs Continuous Delivery vs Continuous Deployment | Source

Most CI/CD tools developed over the past years have been purpose-built for traditional software applications. As you (probably) know, developing and deploying traditional software applications is quite different from building and deploying machine learning (ML) applications in a number of ways. The questions then become:

  • How would ML teams adopt existing CI/CD tools to suit their machine learning use cases?
  • Are there better options out there that are specially purpose-built for ML applications?

In this article, you will learn about how 4 different teams are using—or have used—CI/CD concepts, tools, and techniques to build and deploy their machine learning applications. The purpose of this article is to give you a broad perspective of CI/CD usage from the different solutions implemented by these teams, including their use cases.

Continuous Integration and Delivery (CI/CD) for Machine Learning (ML) with Azure DevOps 

In this section, we walk through the workflow of a team that orchestrates CI/CD processes for their machine learning workloads in Azure DevOps. Their machine learning workloads mostly run on Azure Cloud.

Thanks to Emmanuel Raj for granting me an interview on how his team does CI/CD for their ML workloads. This section leverages both the responses gotten from Emmanuel during the interview and his very practical book on MLOps; Engineering Machine Learning Operations (MLOps)

Industry

Retail and consumer goods.

Use case

This team helps a retail client to resolve tickets in an automated way using machine learning. When users raise tickets or they are generated by maintenance problems, machine learning is used to classify the tickets into different categories, helping in the faster resolution of the tickets.

Core CI/CD tools

Overview

To orchestrate their CI/CD workflow, the team used the Azure DevOps suite of products. They also configured development and production environments for their ML workloads. These workflows consist of all CI/CD processes that happen before deploying the model to production and after deployment.

Azure DevOps logo
Azure DevOps logo | Source

CI/CD workflow before deploying the model to production

  1. To automate the dev-to-production cycle, the team set up build and release tasks with Azure DevOps Pipelines. The build pipeline generates the model artifacts from a candidate source code and after model serialization (mostly using ONNX). 
  1. The artifacts are deployed to infrastructure targets using the release pipelines. The release pipelines move the artifacts to the quality assurance (or QA) stage after they have been tested in the development environment.
  1. The model testing happens in the QA stage where A/B tests and stress tests are performed on the model service by the team to make sure the model is ready to be deployed to the production environment. 
  1. A human validator, usually the product owner, ensures the model passes the tests, has been validated, and then approves the model to be deployed to the production environment using the release pipelines.

CI/CD workflow after deploying the model to production

  1. After deploying the model to production, the team sets up cron jobs (as part of their CI/CD pipeline) that monitor model metrics for data drift and concept drift on a weekly basis so that the pipeline can be triggered when an unacceptable drift occurs that requires retraining the model.
  1. They also monitor the performance of their CI/CD pipeline in production by inspecting the pipeline releases in Azure DevOps Pipelines. The purpose of the inspection is to ensure their CI/CD pipeline is healthy and in a robust state. The guidelines they follow to inspect their CI/CD pipeline, keeping it healthy and robust, include:
  • Auditing system logs and events periodically.
  • Integrating automated acceptance tests.
  • Requiring pull requests to make changes to the pipeline.
  • Peer code reviews for each story or feature before they are added to the pipeline.
  • Regularly reporting metrics that are visible to all members of the team.

To summarize, Azure DevOps provides the team with a set of useful tools that enable the development (machine learning; model building) and operations teams for this project to work in harmony.

Continuous Integration and Delivery (CI/CD) for ML with GitOps using Jenkins and Argo workflows

In this section, you will learn how a team is able to create a framework for how to orchestrate their machine learning workflows and run CI/CD pipelines with GitOps.

Thanks to Tymoteusz Wolodzko, a former ML Engineer at GreenSteam for granting me an interview. This section leverages both the responses gotten from Tymoteusz during the interview and his case study blog post on the Neptune.ai blog. 

Industry

Computer software

Use case

GreenSteam – An i4 Insight Company provides software solutions for the marine industry that help reduce fuel usage. Excess fuel usage is both costly and bad for the environment, and vessel operators are obliged to get more green by the International Maritime Organization and reduce the CO2 emissions by 50 percent by 2050.

Core CI/CD tools

Overview

To implement CI/CD, the team leveraged GitOps using Jenkins running code quality checks and smoke tests using production-like runs in the test environment. The team had a single pipeline for model code where every pull request was going through code reviews and automated unit tests.

The pull requests also went through automated smoke tests where they were training models and making predictions, running the entire end-to-end pipeline on some small chunk of real data to ensure each stage of the pipeline does what is expected and nothing breaks.

For the continuous delivery of models, after training each model a model quality report was generated and was reviewed by a domain expert through a manual process before they were eventually deployed manually after getting validated by the domain expert and passing all prior checks.

Understanding GitOps

GitOps applies a Git-centric approach on top of some common DevOps principles, practices, and tools. In GitOps the code and configuration stored in a repository are considered as a source of truth, where the infrastructure adapts to the changes in code. GitOps helped deliver their pipelines on Amazon EKS at the pace the team required without operational issues.

Code quality checks and using Jenkins to manage the CI pipeline

Jenkins is one of the most popular tools used for continuous integration among developers. The team adopted Jenkins for continuous integration to make their suite of tests, checks, and reviews more efficient. 

  1. To maintain consistency in code quality, they moved all the code checks into the Docker container containing the model code, so versions and configs of tools for code quality checks including flake8, black, mypy, pytest were all unified. This also helped them with unifying the local development setup with what they used on Jenkins. 
  1. Docker ensured they had no more problems with different versions of dependencies that could lead to different results locally and in Jenkins or in production.
  1. For local development, they had a Makefile to build the Docker image and run all the checks and tests on the code.
  1. For code reviews, they set up Jenkins and it was running the same checks as a part of the CI pipeline.

Using Argo to manage CI/CD pipelines

The team needed to test their model for multiple datasets of different clients in different scenarios. As Tymoteusz Wolodzko admitted in his explainer blog post, that was not something they wanted to set up and run manually. 

They needed orchestration and automated pipelines which they should be able to plug into the production environment easily. Dockerizing their ML code made it easy to move the application across different environments, and that includes the production environment.

For orchestration, the team switched from Airflow to Argo Workflows, so plugging in their container was just a matter of writing a few lines of YAML code. 

Argo Workflows & Pipelines is an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. It is a cloud-native solution designed from the ground up for Kubernetes. You can define pipeline workflows, where individual steps are taken as a container. 

Argo Workflows allowed the team to easily run compute-intensive jobs for machine learning or data processing on their Amazon EKS clusters. The models in the pipeline would retrain periodically, based on scheduled jobs, and also undergo the necessary tests and checks. But before the models were deployed, they were reviewed and audited by a domain expert. Once the expert validated that the model is good to be deployed, the models would then be deployed manually.

Below is an illustration showing the team’s entire stack for ML workloads:

GreenSteam-MLOPs-toolstack_1
MLOps technological stack at GreenSteams | Source

Continuous Integration and Delivery (CI/CD) for ML with AWS CodePipeline and Step Functions

To orchestrate their CI/CD workflow, the team in this section used a combination of AWS CodePipeline and AWS Step Functions to ensure they are building an automated MLOps pipeline.

Thanks to Phil Basford for granting me an interview on how his team did CI/CD for a public ML use case.

Industry

Transportation and logistics.

Use case

For this use case, the team is from a consulting and professional services company that worked on a public project. Specifically, they built machine learning applications that solved problems like:

  • Predicting how long it will take to deliver a parcel,
  • Predicting a location, based on unstructured address data and resolving it to a coordinate system (latitude/longitude). 

Core CI/CD tools

  • AWS CodeBuild – A fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy.
  • AWS CodePipeline – A fully managed continuous delivery service that helps you automate your release pipelines.
  • AWS Step Functions – A serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services.

Overview

AWS Cloud provides managed CI/CD workflow tools like AWS CodePipeline and AWS Step Functions to carry out continuous integration and continuous delivery for their machine learning projects. For continuous integration, the team used git to make commits to AWS CodeCommit which triggers a build step in CodePipeline (through an AWS CodeBuild job), with AWS Step Functions handling the orchestration of the workflows for every action from CodePipeline.

Understanding the architecture

The workflow orchestration process from AWS Step Functions made it easy for the team to manage the complexities that arise from running multiple models and pipelines with CodePipelines. Multi-model deployments made by the team are easier to manage and update because each pipeline job in CodePipeline focuses on one process, builds are also simpler to deliver and troubleshoot.

Below is an example of a project that uses AWS CodePipeline along with Step Functions for orchestrating ML pipelines that require custom containers. Here, CodePipeline invokes Step Functions and passes the container image URI and the unique container image tag as parameters to Step Functions:

Architecture to build a CI/CD pipeline for deploying custom machine learning models using AWS services
Architecture to build a CI/CD pipeline for deploying custom machine learning models using AWS services | Source

You can learn more about the architecture above in this blog post. While this team opted to use those tools to manage and orchestrate, it is worth noting that for continuous integration and continuous delivery (CI/CD) pipelines, AWS released Amazon SageMaker Pipelines, an easy-to-use, specifically designed CI/CD service for ML. 

Pipelines is a native workflow orchestration tool for building ML pipelines that takes advantage of direct SageMaker integration. You can learn more about building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines in this blog post.

Continuous Integration and Delivery (CI/CD) for ML with Vertex AI and TFX on Google Cloud

In this section, we will take a look at a team that was able to leverage pipelines that are more native to machine learning projects than traditional software engineering projects, in choosing and using their workflow orchestration and management tools.

This section leverages Hannes Hapke’s (ML Engineer at Digits Financial, Inc.) workshop on ‘Rapid Iteration with Limited DevOps Resources” during Google Cloud’s Applied ML online summit.

Industry

Business intelligence and financial technology services.

Use case

Digits Financial, Inc. is a fin-tech company offering a visual, machine learning-powered expense monitoring dashboard for startups and small businesses. Their use cases are focused on: 

  1. Creating the most powerful finance engine for modern businesses that is able to ingest and convert a company’s financial information into a live model of business.
  2. Extracting information from unstructured documents to predict future events for customers.
  3. Clustering information to surface what’s most important for the customers’ businesses.

Core CI/CD tools

Overview

The team at Digits was able to orchestrate and manage the continuous integration, delivery, and deployment of their machine learning pipelines through the managed Vertex AI Pipelines product and TensorFlow Extended, all running on Google Cloud infrastructure.

Using an ML-native pipeline tool over traditional CI/CD tools helped the team to ensure consistency in the quality of models, and make sure the models are going through the standard workflows of feature engineering, model scoring, model analysis, model validation, and model monitoring in one unified pipeline.

Machine learning pipelines with TFX

With Tensorflow Extended, the team was able to treat each component of their machine learning stack as individual steps that can be orchestrated by third-party tools such as Apache Beam, Apache Airflow, or Kubeflow Pipelines, when the pipeline is deployed to a testing environment or to their production environment. They were also able to create custom components and add them to their pipeline which would have been very difficult to leverage using traditional CI/CD tools.

Machine Learning pipelines with TFX
Machine learning pipelines with TFX | Adapted and modified from this source

Along with this, they also moved their ML pipelines from Kubeflow to Vertex AI Pipeline from Google Cloud—helping them easily tie together model development (ML) and operations (Ops) into high-performance and reproducible steps. 

One of the core advantages of using Vertex AI Pipelines provided by the team was that it helped them transition from managing their pipeline (self-hosted Kubeflow Pipelines) to leveraging the managed Vertex AI Pipeline service for workflow orchestration, thus shedding the need to maintain databases that store metadata, launch clusters to host and operate the build servers and pipelines. 

Orchestrating with Vertex AI Pipelines

Vertex AI is a managed ML platform for every practitioner to speed up the rate of experimentation and accelerate the deployment of machine learning models. It helped the team to automate, monitor, and govern their ML systems by orchestrating their ML workflow in a serverless manner and storing their workflow’s artifacts using Vertex ML Metadata. 

By storing the artifacts of their ML workflow in Vertex ML Metadata, they could analyze the lineage of their workflow’s artifacts — for example, an ML model’s lineage may include the training data, hyperparameters, and code used by the team to create the model.

A screenshot of Vertex AI ML Pipeline orchestration from the Digits team
A screenshot of Vertex AI ML pipeline orchestration from the digits team | Source

The workflow for the team involved preparing and executing their machine learning pipelines with TensorFlow Extended and shipping them to Vertex AI. They could then manage and orchestrate their pipelines from Vertex AI Pipelines without having to operate their own clusters.

Benefits from using machine learning pipelines

The team was able to benefit from using ML pipelines to orchestrate and manage their ML workloads in a couple of ways. As described in this video by Hannes Hapke, the startup was able to gain the following benefits:

  • Using ML pipelines reduced the DevOps requirements for the team.
  • Migrating to managed ML pipelines reduced the expense of running 24/7 clusters when they hosted the pipelines on their infrastructure.
  • Since ML pipelines are native to ML workflows, model updates were easy to integrate and were automated, freeing up the team to focus on other projects.
  • Model updates were consistent across all ML projects because the teams could run the same tests and reuse the entire pipeline or components of the pipeline.
  • One-stop place for all machine learning-related metadata and information.
  • Models could now automatically be tracked and audited.

Conclusion

One interesting point that this article sheds light on is that using CI/CD tools is not enough to successfully operationalize your machine learning workloads. While the majority of teams in this article still use traditional CI/CD tools, we are beginning to see the emergence of ML-native pipeline tools that could help teams (regardless of their size) deliver better machine learning products faster and more reliably.

If you and your team are considering adopting a CI/CD solution for your machine learning workloads, any one of the ML-native pipeline tools may be worth starting out with over traditional software engineering-based CI/CD tools, depending on your team’s circumstances and the tool or vendor that is favorable to work with, of course.

For these tools, you can check out:

For the next steps on leveraging CI/CD for ML, you can check out the following articles:

Till next time, happy Ops-ing!

References and resources

Orchestrating CI/CD with Azure DevOps

CI/CD with GitOps using Jenkins and Argo Workflows

Simplified CI/CD on AWS Cloud with AWS CodePipeline and Step Functions

Using Vertex AI and TFX to orchestrate ML pipelines on Google Cloud.


READ NEXT

Continuum Industries Case Study: How to Track, Monitor & Visualize CI/CD Pipelines

7 mins read | Updated August 9th, 2021

Continuum Industries is a company in the infrastructure industry that wants to automate and optimize the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.  

Its core product Optioneer lets customers input the engineering design assumptions and the geospatial data and uses evolutionary optimization algorithms to find possible solutions to connect point A to B given the constraints. 

As Chief Scientist Andreas Malekos, who works on the Optioneer AI-powered engine, explains:

“Building something like a power line is a huge project, so you have to get the design right before you start. The more reasonable designs you see, the better decision you can make. Optioneer can get you design assets in minutes at a fraction of the cost of traditional design methods.”

Andreas Malekos

Andreas Malekos

Chief Scientist @Continuum Industries

But creating and operating the Optioneer engine is more challenging than it seems:

  • The objective function does not represent reality
  • There are a lot of assumptions that civil engineers don’t know in advance
  • Different customers feed it completely different problems, and the algorithm needs to be robust enough to handle those

Instead of building the perfect solution, it’s better to present them with a list of interesting design options so that they can make informed decisions.

The engine team leverages a diverse skillset from mechanical engineering, electrical engineering, computational physics, applied mathematics, and software engineering to pull this off.

Problem

A side effect of building a successful software product, whether it uses AI or not, is that people rely on it working. And when people rely on your optimization engine with million-dollar infrastructure design decisions, you need to have a robust quality assurance (QA) in place.

As Andreas pointed out, they have to be able to say that the solutions they return to the users are:

  • Good, meaning that it is a result that a civil engineer can look at and agree with
  • Correct, meaning that all the different engineering quantities that are calculated and returned to the end-user are as accurate as possible

On top of that, the team is constantly working on improving the optimization engine. But to do that, you have to make sure that the changes:

  • Don’t break the algorithm in some way or another
  • They actually improve the results not just on one infrastructure problem but across the board

Basically, you need to set up a proper validation and testing, but the nature of the problem the team is trying to solve presents additional challenges:

  • You cannot automatically tell whether an algorithm output is correct or not. It is not like in ML where you have labeled data to compute accuracy or recall on your evaluation set. 
  • You need a set of example problems that is representative of the kind of problem that the algorithm will be asked to solve in production. Furthermore, these problems need to be versioned so that repeatability is as easily achievable as possible.
Continue reading ->
Feature store and data ingestion mlops

How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack

Read more
ML pipeline problems solutions

Building ML Pipeline: 6 Problems & Solutions [From a Data Scientist’s Experience]

Read more
Recommender system lessons

Recommender Systems: Lessons From Building and Deployment

Read more
MLOps pillars

Pillars of MLOps and How to Implement Them

Read more