In software development, Continuous Integration (CI) is a practice of merging code changes from the entire team to the shared codebase often. Before any new code can be merged it is tested and checked for quality automatically.
CI makes the codebase up-to-date, clean, and tested by design and helps to find any problems with it quickly.
But what does Continuous Integration mean for machine learning?
The way I see it:
Continuous Integration in machine learning extends the concept to running model training or evaluation jobs for each trigger event (like merge request or commit).
This should be done in a way that is versioned and reproducible to ensure that when things are added to the shared codebase they are properly tested and available for audit when needed.
Some examples of CI workflows in machine learning could be:
- running and versioning the training and evaluation for every commit to the repository,
- running and comparing experiment runs for each Pull Request to a certain branch.
- creating model predictions on a test set and saving them somewhere on every PR to the feature branch.
- about a million other model training and testing scenarios that could be automated.
Bookmark for later
Good news is today there are tools for that and in this article, I will show you how to set up Continuous Integration workflow with two of those:
- Github Actions: that lets you run CI workflows directly from Github
- Neptune: which makes experiment tracking and model versioning easy
Please note that due to the recent API update, this post needs some changes as well – we’re working on it! In the meantime, please check the Neptune documentation, where everything is up to date! 🥳
You will learn
How to set up a CI pipeline that automates the following scenario.
On every Pull Request from branch develop to master:
- Run model training and log all the experiment information to Neptune for both branches
- Create a comment that contains a table showing diffs in parameters, properties, and metrics, links to experiments and experiment comparison in Neptune
See this Pull Request on Github

CI for machine learning: Step-by-step guide
Before you start
Make sure you meet the following prerequisites before starting the how-to steps:
- Create a Github repository
- Create a Neptune project: this is optional. I will be using an open project and log information as an anonymous user.
Note:
You can see this example project with the markdown table in the Pull Request on Github. Workflow config, environment file, and the training script are all there.
Step 1: Add Neptune logging to your training scripts
In this example project, we will be training a lightGBM multiclass classification model.
Since we want to properly keep track of models we will also save the learning curves, evaluation metrics on testset, and performance charts like the ROC curve.
1. Add Neptune tracking to your training script
Let me show you first and explain later.
import os
import lightgbm as lgb
import matplotlib.pyplot as plt
import neptune
from neptunecontrib.monitoring.lightgbm import neptune_monitor
from scikitplot.metrics import plot_roc, plot_confusion_matrix, plot_precision_recall
from sklearn.datasets import load_wine
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
PARAMS = {'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class': 3,
'num_leaves': 8,
'learning_rate': 0.01,
'feature_fraction': 0.9,
'seed': 1234
}
NUM_BOOSTING_ROUNDS = 10
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(data.data,
data.target,
test_size=0.25,
random_state=1234)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# Connect your script to Neptune
neptune.init(api_token=os.getenv('NEPTUNE_API_TOKEN'),
project_qualified_name=os.getenv('NEPTUNE_PROJECT_NAME'))
# Create an experiment and log hyperparameters
neptune.create_experiment('lightGBM-on-wine',
params={**PARAMS,
'num_boosting_round': NUM_BOOSTING_ROUNDS})
gbm = lgb.train(PARAMS,
lgb_train,
num_boost_round=NUM_BOOSTING_ROUNDS,
valid_sets=[lgb_train, lgb_eval],
valid_names=['train', 'valid'],
callbacks=[neptune_monitor()], # monitor learning curves
)
y_test_pred = gbm.predict(X_test)
f1 = f1_score(y_test, y_test_pred.argmax(axis=1), average='macro')
accuracy = accuracy_score(y_test, y_test_pred.argmax(axis=1))
# Log metrics to Neptune
neptune.log_metric('accuracy', accuracy)
neptune.log_metric('f1_score', f1)
fig_roc, ax = plt.subplots(figsize=(12, 10))
plot_roc(y_test, y_test_pred, ax=ax)
fig_cm, ax = plt.subplots(figsize=(12, 10))
plot_confusion_matrix(y_test, y_test_pred.argmax(axis=1), ax=ax)
fig_pr, ax = plt.subplots(figsize=(12, 10))
plot_precision_recall(y_test, y_test_pred, ax=ax)
# Log performance charts to Neptune
neptune.log_image('performance charts', fig_roc)
neptune.log_image('performance charts', fig_cm)
neptune.log_image('performance charts', fig_pr)
It is a typical model training script with a few additions:
- We connected Neptune to the script with
neptune.init()
and passed our API token and the project name - We created an experiment and saved parameters with
neptune.create_experiment(params=PARAMS)
- We added learning curves callback with
callbacks=neptune_monitor()
- We logged test evaluation metrics with
neptune.log_metric()
- We logged performance charts with
neptune.log_image()
Now when you run your script:
python train.py
You should get something like this:
See this experiment in Neptune
2. Add a snippet that is run only in the CI environment.
Add the following snippet at the bottom of your training script.
if os.getenv('CI') == "true":
neptune.append_tag('ci-pipeline', os.getenv('NEPTUNE_EXPERIMENT_TAG_ID'))
What this does is:
- fetch the
CI
environment variable to see whether code is run inside the Github Actions workflow - add a
ci-pipeline
tag to the experiment so that it is easier to filter out things in Neptune UI - get the
NEPTUNE_EXPERIMENT_TAG_ID
environment variable used to identify the experiment in the CI workflow and log it to Neptune (this will become clear later).
Step 2: Create an environment file
Having an environment setup file which makes it easy to create your training or evaluation environment from scratch is generally a good practice.
But when you are training models in the CI workflow (like Github Actions) this is a must. The environment where the workflow is executed (and models are trained) will be created from scratch every time your workflow is triggered.
There are a few choices when it comes to environment setup files. You can use:
- Pip and the
requirements.txt
, - Conda and the
environment.yaml
, - Docker and the
Dockerfile
(this is often the best option)
Let’s go with the simplest solution and create a requirements.txt
file with all the packages we need:
requirements.txt
lightgbm==2.3.1
neptune-client==0.4.125
neptune-contrib==0.24.8
numpy==1.19.0
scikit-learn==0.23.1
scikit-plot==0.3.7
Now, whenever you need to run your training install all the packages with:
pip install -r requirements.txt
Step 3: Set up Github Secrets
GitHub Secrets allow you to pass sensitive information like keys or passwords to the Github CI workflow runners so that your automated tests can be executed.
In our case, two sensitive things are needed:
NEPTUNE_API_TOKEN
: I’ll set it to the key of anonymous Neptune userANONYMOUS
NEPTUNE_PROJECT_NAME
: I’ll set it to the open projectshared/github-actions
Note:
You can set those to your API token and the Neptune project you created.
Without those, Github wouldn’t know where to send the experiments and Neptune wouldn’t know who is sending them (and whether this should be allowed).
To set up GitHub Secrets:
- Go to your Github project
- Go to the Settings tab
- Go to the Secrets section
- Click on New secret
- Specify the name and value of the secret (similarly to environment variables)
Step 4: Create .github/workflows directory and a .yml action file
Github will run all workflows that you define in the .github/workflows
directory with .yml configuration files. Which means you need to:
1. Create .github/workflows
directory:
Go to your project repository and create both .github
and .github/workflows
directories.
mkdir -p .github/workflows
2. Create neptune_action.yml
Workflow configs that define actions are .yml files of a certain structure that you put in the .github/workflows
directory. You can have multiple .yml files to fire multiple workflows.
I will create just one neptune_action.yml
touch .github/workflows/neptune_action.yml
As a result, you should see:
your-repository ├── .github │ └── workflows │ └── neptune_action.yml ├── .gitignore ├── README.md ├── requirements.txt └── train.py
Step 5: Define your workflow .yml config
Workflow configs are .yml files where you specify what you want to happen and when.
In a nutshell, you define:
on
which Github event you would like to trigger the workflow. For example, on a commit to the master branch,- what are the
jobs
you would like to perform. This is mostly to organize the config, - where should those
jobs
be performed. For example,runs-on: ubuntu-latest
will run your workflow on the latest version of ubuntu. - what are the
steps
within each job that you would like to run sequentially. For example, create an environment, run training, and run evaluation of the model.
In our machine learning CI workflow we need to run the following sequence of steps:
- Checkout to branch
develop
- Setup the environment and run model training on branch
develop
- Checkout to branch
master
- Setup the environment and run model training for branch
master
- Fetch data from Neptune and create an experiment comparison markdown table
- Comment on the PR with that markdown table
Here is the neptune_action.yml
that does all that. I know it seems complex but in reality, it’s just a bunch of steps that run terminal commands with some boilerplate around it.
neptune_action.yml
name: Neptune actions
on:
pull_request:
branches: [master]
jobs:
compare-experiments:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7]
env:
NEPTUNE_API_TOKEN: ${{ secrets.NEPTUNE_API_TOKEN }}
NEPTUNE_PROJECT_NAME: ${{ secrets.NEPTUNE_PROJECT_NAME }}
steps:
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Checkout pull request branch
uses: actions/checkout@v2
with:
ref: develop
- name: Setup pull request branch environment and run experiment
id: experiment_pr
run: |
pip install -r requirements.txt
export NEPTUNE_EXPERIMENT_TAG_ID=$(uuidgen)
python train.py
echo ::set-output name=experiment_tag_id::$NEPTUNE_EXPERIMENT_TAG_ID
- name: Checkout main branch
uses: actions/checkout@v2
with:
ref: master
- name: Setup main branch environment and run experiment
id: experiment_main
run: |
pip install -r requirements.txt
export NEPTUNE_EXPERIMENT_TAG_ID=$(uuidgen)
python train.py
echo ::set-output name=experiment_tag_id::$NEPTUNE_EXPERIMENT_TAG_ID
- name: Get Neptune experiments
env:
MAIN_BRANCH_EXPERIMENT_TAG_ID: ${{ steps.experiment_main.outputs.experiment_tag_id }}
PR_BRANCH_EXPERIMENT_TAG_ID: ${{ steps.experiment_pr.outputs.experiment_tag_id }}
id: compare
run: |
pip install -r requirements.txt
python -m neptunecontrib.create_experiment_comparison_comment \
--api_token $NEPTUNE_API_TOKEN \
--project_name $NEPTUNE_PROJECT_NAME \
--tag_names $MAIN_BRANCH_EXPERIMENT_TAG_ID $PR_BRANCH_EXPERIMENT_TAG_ID \
--filepath comment_body.md
result=$(cat comment_body.md)
echo ::set-output name=result::$result
- name: Create a comment
uses: peter-evans/commit-comment@v1
with:
body: |
${{ steps.compare.outputs.result }}
You can just copy this file and paste it into your .github/workflows
directory and it will work out of the box.
That said, there are some things that you may need to adjust to your setup:
- Branch names if you want to trigger your workflow on PR from a branch different then
develop
or to a branch different thanmaster
. - Environment setup steps if you are using anything different than pip and
requirements.txt
. - The command that runs your training scripts.
Note:
Explaining this config in detail would make this post really long so I decided not to :). If you’d like to understand everything about it check out Github Actions Documentation (which is great by the way).
Step 6: Push it to Github
Now you need to push this workflow to GitHub.
git add .github/workflows train.py requirements.txt;
git commit -m "added continuous integration"
Since our workflow will be triggered on every Pull Request to master, nothing will happen just yet.
Step 7: Create a Pull Request
Now everything is ready and you just need to create a PR from branch develop
to master
.
- Checkout to a new branch
develop
git checkout -b develop
2. Change some parameters in train.py
train.py
PARAMS = {'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class': 3,
'num_leaves': 15, # previously 8
'learning_rate': 0.01,
'feature_fraction': 0.85, #previous 0.9
'seed': 1234
}
3. Add, commit and push your changes to the previously created branch develop
git add train.py;
git commit -m"tweaked parameters"
git push origin develop
4. Go to Github and create a Pull Request from branch develop
to master
.
The workflow is triggered and it goes through all the steps one by one.
Explore the result
If everything worked correctly you should see a Pull Request comment that shows:
- Diffs in parameters, properties, and evaluation metrics.
- Experiment IDs and links to both the main and PR branch runs in Neptune. You can go and see all the details of those experiments including learning curves and performance charts that were logged for those experiments.
- A link to a full comparison between those runs in Neptune.
See this Pull Request on Github

Final thoughts
Ok, so in this how-to guide, you learned how to set up a Continuous Integration workflow that creates a comparison table for every Pull Request to master.
Hopefully, with this information, you will be able to create the CI workflow that works for your machine learning project!
READ NEXT
Continuum Industries Case Study: How to Track, Monitor & Visualize CI/CD Pipelines
7 mins read | Updated August 9th, 2021
Continuum Industries is a company in the infrastructure industry that wants to automate and optimize the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.
Its core product Optioneer lets customers input the engineering design assumptions and the geospatial data and uses evolutionary optimization algorithms to find possible solutions to connect point A to B given the constraints.
As Chief Scientist Andreas Malekos, who works on the Optioneer AI-powered engine, explains:
“Building something like a power line is a huge project, so you have to get the design right before you start. The more reasonable designs you see, the better decision you can make. Optioneer can get you design assets in minutes at a fraction of the cost of traditional design methods.”
But creating and operating the Optioneer engine is more challenging than it seems:
- The objective function does not represent reality
- There are a lot of assumptions that civil engineers don’t know in advance
- Different customers feed it completely different problems, and the algorithm needs to be robust enough to handle those
Instead of building the perfect solution, it’s better to present them with a list of interesting design options so that they can make informed decisions.
The engine team leverages a diverse skillset from mechanical engineering, electrical engineering, computational physics, applied mathematics, and software engineering to pull this off.
Problem
A side effect of building a successful software product, whether it uses AI or not, is that people rely on it working. And when people rely on your optimization engine with million-dollar infrastructure design decisions, you need to have a robust quality assurance (QA) in place.
As Andreas pointed out, they have to be able to say that the solutions they return to the users are:
- Good, meaning that it is a result that a civil engineer can look at and agree with
- Correct, meaning that all the different engineering quantities that are calculated and returned to the end-user are as accurate as possible
On top of that, the team is constantly working on improving the optimization engine. But to do that, you have to make sure that the changes:
- Don’t break the algorithm in some way or another
- They actually improve the results not just on one infrastructure problem but across the board
Basically, you need to set up a proper validation and testing, but the nature of the problem the team is trying to solve presents additional challenges:
- You cannot automatically tell whether an algorithm output is correct or not. It is not like in ML where you have labeled data to compute accuracy or recall on your evaluation set.
- You need a set of example problems that is representative of the kind of problem that the algorithm will be asked to solve in production. Furthermore, these problems need to be versioned so that repeatability is as easily achievable as possible.