MLOps Blog

How to Automate ML Experiment Management With CI/CD

Kilian Kluge , Dhruvil Karani

8 min

5th June, 2024

MLOps

Using CI/CD workflows to run ML experiments ensures their reproducibility, as all the required information has to be contained under version control.

GitHub’s CI/CD solution, GitHub Actions, is popular because it’s directly integrated into the platform and easy to use. GitHub Actions and Neptune are an ideal combination for automating machine-learning model training and experimentation.

Getting started with CI/CD for experiment management requires just a few changes to the training code, ensuring that it can run standalone on a remote machine.

The compute resources offered by GitHub Actions directly are not suitable for larger-scale ML workloads. It’s possible to register one’s own compute resources to host GitHub Actions workflows.

ML experiments are, by nature, full of uncertainty and surprises. Small changes can lead to huge improvements, but sometimes, even the most clever tricks don’t yield results.

Either way, systematic iteration and exploration are the way to go. This is where things often start getting messy. With the many directions we could take, it’s easy to lose sight of what we’ve tried and its effect on our model’s performance. Furthermore, ML experiments can be time-consuming, and we risk wasting money by re-running experiments with already-known results.

Using an experiment tracker like neptune.ai, we can meticulously log information about our experiments and compare the outcomes of different attempts. This allows us to identify which hyperparameter settings and data samples contribute positively to our model’s performance.

But, recording metadata is only half the secret to ML modeling success. We also need to be able to launch experiments to make progress quickly. Many data science teams with a Git-centered workflow find CI/CD platforms the ideal solution.

In this article, we’ll explore this approach to managing machine-learning experiments and discuss when this approach is right for you. We’ll focus on GitHub Actions, the CI/CD platform integrated into GitHub, but the insights also apply to other CI/CD frameworks.

Why should you adopt CI/CD for machine learning experiments?

A machine-learning experiment typically involves training a model and evaluating its performance. Initially, we set up the model’s configuration and the training algorithm. Then, we launch the training on a well-defined dataset. Finally, we evaluate the model’s performance on a test dataset.

Many data scientists prefer working in notebooks. While this works well during the exploratory phase of a project, it quickly becomes difficult to keep track of the configurations we’ve tried.

Even when we log all relevant information with an experiment tracker and store snapshots of our notebooks and code, returning to a previous configuration is often tedious.

With a version control system like Git, we can easily store a specific code state, return to it, or branch off in different directions. We can also compare two versions of our model training setup to uncover what changed between them.

However, there are several problems:

An experiment is only replicable if the environment, dataset, and dependencies are well-defined. Just because model training runs fine on your laptop, it’s not a given that your colleague can also run it on theirs – or that you’ll be able to re-run it in a couple of months – based on the information contained in the Git repository.

Setting up the training environment is often cumbersome. You have to install the necessary runtimes and dependencies, configure access to datasets, and set up credentials for the experiment tracker. If model training takes a long time or requires specialized hardware like GPUs, you’ll sometimes find yourself spending more time setting up remote servers than solving your modeling problem.

It’s easy to forget to commit all relevant files to source control each time you run an experiment. When launching a series of experiments in quick succession, it’s easy to forget to commit the source code between each pair of runs.

The good news is that you can solve all these problems by running your machine-learning experiments using a CI/CD approach. Instead of treating running the experiments and committing the code as separate activities, you link them directly.

Here’s what this looks like:

You configure the experiment and commit the code to your Git repository.
You push the changes to the remote repository (in our case, GitHub).
Then, there are two alternatives that teams typically use:

The CI/CD system (in our case, GitHub Actions) detects that a new commit has been pushed and launches a training run based on the code.
You manually trigger a CI/CD workflow run with the latest code in the repository, passing the model and training parameters as input values.

Since this will only work if the experiment is fully defined within the repository and there is no room for manual intervention, you’re forced to include all relevant information in the code.

Comparison of a machine-learning experimentation setup without CI/CD and with CI/CD. — Comparison of a machine-learning experimentation setup without (left) and with (right) CI/CD. Without CI/CD, the training is conducted on a local machine. There is no guarantee that the environment is well-defined or that the exact version utilized code is stored in the remote GitHub repository. In the setup with CI/CD, the model training runs on a server provisioned based on the code and information in the GitHub repository.

Tutorial: Automating your machine learning experiments with GitHub Actions

In the following sections, we’ll walk through the process of setting up a GitHub Actions workflow to train a machine-learning model and log metadata to Neptune.

To follow along, you need a GitHub account. We’ll assume that you’re familiar with Python and the basics of machine learning, Git, and GitHub.

You can either add the CI/CD workflow to an existing GitHub repository that contains model training scripts or create a new one. If you’re just curious about what a solution looks like, we’ve published a complete version of the GitHub Actions workflow and an example training script. You can also explore the full example Neptune project.

Editor’s note

Do you feel like experimenting with neptune.ai?

Create a free account right away and give it a go
Try it out first and learn how it works (zero setup, no registration)
See the docs or watch a short product demo (20 min)

Step 1: Structure your training script

When you’re looking to automate model training and experiments via CI/CD, it’s likely that you already have a script for training your model on your local machine. (If not, we’ll provide an example at the end of this section.)

To run your training on a GitHub Actions runner, you must be able to set up the Python environment and launch the script without manual intervention.

There are several best practices we recommend you follow:

Create separate functions for loading data and training the model. This splits your training script into two reusable parts that you can develop and test independently. It also allows you to load the data just once but train multiple models on it.
Pass all model and training parameters that you want to change between experiments via the command line. Instead of relying on a mix of hard-coded default values, environment variables, and command-line arguments, define all parameters through a single method. This will make it easier to trace how values pass through your code and provide transparency to the user. Python’s built-in argparse module offers all that’s typically required, but more advanced options like typer and click are available.
Use keyword arguments everywhere and pass them via dictionaries. This prevents you from getting lost among the tens of parameters that are typically required. Passing dictionaries allows you to log and print the precise arguments used when instantiating your model or launching the training.
Print out what your script is doing and the values it’s using. It will be tremendously helpful if you can see what’s happening by observing your training script’s output, particularly if something doesn’t go as expected.
Do not include API tokens, passwords, or access keys in your code. Even though your repository might not be publicly available, it’s a major security risk to commit access credentials to version control or to share them. Instead, they should be passed via environment variables at runtime. (If this is not yet familiar to you but you need to fetch your training data from remote storage or a database server, you can skip ahead to steps 3 and 4 of this tutorial to learn about one convenient and safe way to handle credentials.)
Define and pin your dependencies. Since GitHub Actions will prepare a new Python environment for every training run, all dependencies must be defined. Their versions should be fixed to create reproducible results. In this tutorial, we’ll use a requirements.txt file, but you can also rely on more advanced tools like Poetry, Hatch, or Conda.

Here’s a full example of a training script for a scikit-learn DecisionTreeClassifier on the well-known iris toy dataset that we’ll use throughout the remainder of this tutorial:

from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier




def load_data():
   print("Loading dataset")
   iris_dataset = load_iris()
   X, y = iris_dataset.data, iris_dataset.target
   print(f"Loaded dataset with {len(X)} samples.")
   return train_test_split(X, y, test_size=1 / 3)




def train(data, criterion, max_depth):
   print(“Training a DecisionTreeClassifier”)


   print(“Unpacking training and evaluation data…”)
   X_train, X_test, y_train, y_test = data


   print(“Instantiating model…”)
   model_parameters = {
       "criterion": criterion,
       "splitter": "best",
       "max_depth": max_depth,
   }
   print(model_parameters)
   model = DecisionTreeClassifier(**model_parameters)


   print("Fitting model...")
   model.fit(X_train, y_train)


   print("Evaluating model...")
   y_pred = model.predict(X_test)
   evaluation = {
       "f1_score": f1_score(y_test, y_pred, average="macro"),
       "accuracy": accuracy_score(y_test, y_pred),
   }
   print(evaluation)




if __name__ == "__main__":
   parser = argparse.ArgumentParser()
   parser.add_argument("--criterion", type=str)
   parser.add_argument("--max-depth", type=int)
   args = parser.parse_args()


   data = load_data()
   train(
       data,
       criterion=args.criterion,
       max_depth=args.max_depth,
   )

The only dependency of this script is scikit-learn, so our requirements.txt looks as follows:

scikit-learn==1.4.2

The training script can be launched from the terminal like this:

python train.py --criterion gini --max-depth 10

Step 2: Set up a GitHub Actions workflow

GitHub Actions workflows are defined as YAML files and have to be placed in the .github/workflows directory of our GitHub repository.

In that directory, we’ll create a train.yamlworkflow definition file that initially just contains the name of the workflow:

name: Train Model

We use the workflow_dispatch trigger, which allows us to manually launch the workflow from the GitHub repository. With the inputs block, we specify the input parameters we want to be able to set for each run:

on:
 workflow_dispatch:
   inputs:
     criterion:
       description: "The function to measure the quality of a split:"
       default: gini
       type: choice
       options:
         - gini
         - entropy
         - log_loss
     max-depth:
       description: "The maximum depth of the tree:"
       type: number
       default: 5

Here, we’ve defined the input parameter “criterion” as a selection of one of three possible values. The “max-depth” parameter is a number that we can enter freely (see the GitHub documentation for all supported types).

Our workflow contains a single job for training the model:

jobs:
 train:
   runs-on: ubuntu-latest
   steps:
     - name: Check out repository
       uses: actions/checkout@v4


     - name: Set up Python
       uses: actions/setup-python@v5
       with:
         python-version: '3.12'
         cache: 'pip'


     - name: Install dependencies
       run: |
          pip install --upgrade pip
          pip install -r requirements.txt


     - name: Train model
       run: |
         python train.py \
           --criterion ${{ github.event.inputs.criterion }} \
           --max-depth ${{ github.event.inputs.max-depth }}

This workflow checks out the code, sets up Python, and installs the dependencies from our requirements.txt file. Then, it launches the model training using our train.py script.

Once we’ve committed the workflow definition to our repository and pushed it to GitHub, we’ll see our new workflow in the “Actions” tab. From there, we can launch it as described in the following screenshot:

Manually launching the GitHub Actions workflow from the GitHub UI. — Manually launching the GitHub Actions workflow from the GitHub UI | Source: Author

Navigate to the “Actions” tab, select the “Train Model” workflow in the sidebar on the left-hand side, and click the “Run workflow” dropdown in the upper right-hand corner of the run list. Then, set the input parameters, and finally click “Run workflow” to launch the workflow. (For more details, see Manually running a workflow in the GitHub documentation.)

If everything is set up correctly, you’ll see a new workflow run appear in the list. (You might have to refresh your browser if it does not appear after a few seconds.) If you click on the run, you can see the console logs and follow along as the GitHub runner executes the workflow and training steps.

Step 3: Add Neptune logging to the script

Now that we’ve automated the model training, it’s time to start tracking the training runs with Neptune. For this, we’ll have to install additional dependencies and adapt our training script.

For Neptune’s client to send the data we collect to Neptune, it needs to know the project name and an API token that grants access to the project. Since we don’t want to store this sensitive information in our Git repository, we’ll pass it to our training script through environment variables.

Pydantic’s BaseSettings class is a convenient way to parse configuration values from environment variables. To make it available in our Python environment, we have to install it via pip install pydantic-settings.

At the top of our training script, right below the imports, we add a settings class with two entries of type “str”:

from pydantic_settings import BaseSettings




class Settings(BaseSettings):
   NEPTUNE_PROJECT: str
   NEPTUNE_API_TOKEN: str




settings = Settings()

When the class is initialized, it reads the environment variables of the same name. (You can also define default values or use any of the many other features of Pydantic models).

Next, we’ll define the data we track for each training run. First, we install the Neptune client by running pip install neptune. If you’re following along with the example or are training a different scikit-learn model, also install Neptune’s scikit-learn integration via pip install neptune-sklearn.

Once the installation has been completed, add the import(s) to the top of your train.py script:

import neptune

import neptune.integrations.sklearn as npt_utils  # for scikit-learn models

Then, at the end of our train() function, after the model has been trained and evaluated, initialize a new Neptune run using the configuration variables in the settingsobject we defined above:

run = neptune.init_run(
   project=settings.NEPTUNE_PROJECT,
   api_token=settings.NEPTUNE_API_TOKEN,
)

A Run is the central object for logging experiment metadata with Neptune. We can treat it like a dictionary to add data. For example, we can add the dictionaries with the model’s parameters and the evaluation results:

run["model/parameters"] = model_parameters
run["evaluation"] = evaluation

We can add structured data like numbers and strings, as well as series of metrics, images, and files. To learn about the various options, have a look at the overview of essential logging methods in the documentation.

For our example, we’ll use Neptune’s scikit-learn integration, which provides utility functions for typical use cases. For example, we can generate and log a confusion matrix and upload the trained model:

run["visuals/confusion_matrix"] = npt_utils.create_confusion_matrix_chart(
   model, X_train, X_test, y_train, y_test
)
run["estimator/pickled-model"] = npt_utils.get_pickled_model(model)

We conclude the Neptune tracking block by stopping the run, which is now the last line in our train()function:

run.stop()

To see a complete version of the training script, head to the GitHub repository for this tutorial.

Before you commit and push your changes, don’t forget to add pydantic-settings, neptune, and neptune-sklearnto your requirements.txt.

Step 4: Set up a Neptune project and pass credentials to the workflow

The last ingredients we need before launching our first tracked experiment are a Neptune project and a corresponding API access token.

If you don’t yet have a Neptune account, head to the registration page to sign up for a free personal account.

Log in to your Neptune workspace and either create a new project or select an existing one. In the bottom-left corner of your screen, click on your user name and then on “Get your API token”:

Setting up a Neptune project and pass credentials to the workflow — Get you API token view | Source: Author

Copy the API token from the widget that pops up.

Now, you can head over to your GitHub repository and navigate to the “Settings” tab. There, select “Environments” in the left-hand sidebar and click on the “New environment” button in the upper right-hand corner. Environments are how GitHub Actions organizes and manages access to configuration variables and credentials.

We’ll call this environment “Neptune” (you can also pick a project-specific name if you plan to log data to different Neptune accounts from the same repository) and add a secret and a variable to it.

Setting up Neptune's project in GitHub repository — Setting up Neptune’s project in GitHub repository | Source: Author

The NEPTUNE_API_TOKEN secret contains the API token we just copied, and the NEPTUNE_PROJECT variable is the full name of our project, including the workspace name. While variables are visible in plain text, secrets are stored encrypted and are only accessible from GitHub Actions workflows.

To learn the project name, navigate to the project overview page in Neptune’s UI, find your project, and click on “Edit project information”:

Finding project name in neptune.ai — Finding project name in Neptune | Source: Author

This opens a widget where you can change and copy the full name of your project.

Once we’ve configured the GitHub environment, we can modify our workflow to pass the information to our extended training script. We need to make two changes:

In our job definition, we’ll have to specify the name of the environment to retrieve the secrets and variables from:

jobs:
 train:
   runs-on: ubuntu-latest
   environment: Neptune
   steps:
      # …

In our training step, we pass the secret and the variable as environment variables:

- name: Train model
  env:
   NEPTUNE_API_TOKEN: ${{ secrets.NEPTUNE_API_TOKEN }}
   NEPTUNE_PROJECT: ${{ vars.NEPTUNE_PROJECT }}
  run: |
   python train.py \
     --criterion ${{ github.event.inputs.criterion }} \
     --max-depth ${{ github.event.inputs.max-depth }}

Step 5: Run training and inspect results

Now, it’s finally time to see everything in action!

Head to the “Actions” tab, select our workflow and launch it. Once the training is completed, you’ll see from the workflow logs how the Neptune client collects and uploads the data.

In Neptune’s UI, you’ll find the experiment run in your project’s “Runs” view. You’ll see that Neptune not only tracked the information you defined in your training script but automatically collected a lot of other data as well:

For example, you’ll find your training script and information about the Git commit it belongs to under “source_code.”

If you used the scikit-learn integration and logged a full summary, you can access various diagnostic plots under “summary” in the “All metadata” tab or the “Images” tab:

You can explore the full Neptune example project.

Running GitHub Actions jobs on your own servers

By default, GitHub Actions executes workflows on servers hosted by GitHub, which are called “runners”. These virtual machines are designed to run software tests and compile source code, but not for processing large amounts of data or training machine-learning models.

GitHub also provides an option to self-host runners for GitHub Actions. Simply put, we provision a server, and GitHub connects and runs jobs on it. This allows us to configure virtual machines (or set up our own hardware) with the required specifications, e.g., large amounts of memory and GPU support.

To set up a self-hosted runner, head to the “Settings” tab, click on “Actions” in the left-hand sidebar, and select “Runners” in the sub-menu. In the “Runners” dialogue, click the “New self-hosted runner” button in the upper right-hand corner.

This will open a page with instructions on how to provision a machine that registers as a runner with GitHub Actions. Once you’ve set up a self-hosted runner, you only need to change the runs-on parameter in your workflow file from ubuntu-latest to self-hosted.

For more details, options, and security considerations, see the GitHub Actions documentation.

Conclusion

We’ve seen that it’s straightforward to get started with CI/CD for machine-learning experimentation. Using GitHub Actions and Neptune, we’ve walked through the entire process from a script that works on a local machine to an end-to-end training workflow with metadata tracking.

Developing and scaling a CI/CD-based ML setup takes some time as you discover your team’s preferred way of interacting with the repository and the workflows. However, the key benefits – full reproducibility and transparency about each run – will be there from day one.

Beyond experimentation, you can consider running hyperparameter optimization and model packaging through CI/CD as well. Some data science and ML platform teams structure their entire workflow around Git repositories, a practice known as “GitOps.”

But even if you just train a small model now and then, GitHub Actions is a great way to make sure you can reliably re-train and update your models.