We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

How to Build MLOps Pipelines with GitHub Actions [Step by Step Guide]

For a while now, MLOps has become a crucial process for all companies that are using Machine Learning and Deep Learning on a daily basis. In fact, there are already many solutions specifically built for solving problems related to MLOps pipelines.

However, for many companies that just have a few models into production, the cost of learning and using new tools (in this case the MLOps pipeline tool and automation tools) exceeds the benefits it will return.

The truth is that to build simple MLOps pipelines there is no need to learn new complex tools such as Kubeflow or Airflow. In fact, you can build simple (but effective) MLOps pipelines with one of the most used tools in software development: GitHub.

In this post, I am going to explain how to create MLOps pipelines in a very simple way with Github, Github Actions, and a Cloud service provider. Sounds interesting? Let’s get to it!

What is GitHub Actions?

GitHub Actions is a tool offered by GitHub built to automate software workflows. For instance, software developers use GitHub Actions to automate branch merges, for handling issues in GitHub, doing application tests, etc.

However, we as Data Scientists can also use GitHub Actions for many things. In our case, we will use it to automate several steps of the MLOps workflow, such as:

  • Automating the ETL process.
  • Checking whether the model should be retrained or not.
  • Uploading the new model to your cloud provider and deploying it.

As you can see, GitHub Actions will be used a lot through our MLOps pipeline. The good thing is that if you don’t know how GitHub Actions works, it is super easy to learn: you just have to add a .yaml file in the .workflows/github folder. In this .yaml file, you will specify several things such as:

  • The name of the workflow.
  • When should the workflow trigger: based on a cron schedule, http request, manually, etc.
  • The OS where the workflow will run (Ubuntu, Windows or MAC).
  • Each of the steps that the workflow should execute.

Note: as this is not a Github Actions tutorial so I will not go deeper into the topic. However, here you have a tutorial that will help you learn how to use GitHub Actions.

Pros and cons of using GitHub Actions as MLOps workflows

The good thing about using GitHub Actions for MLOps is not just that we don’t have to learn a new tool, it has many other advantages too:

  • GitHub Actions works with the main programming languages used for Data Science: Python, R, Julia, etc.
  • You can use the experience of learning GitHub Actions to automate other processes.
  • GitHub Actions is free for public repositories. Besides, regarding private repositories, GitHub Actions offers the following:
    • Free accounts: 2,000 minutes/month of GitHub Actions.
    • Team accounts: 3,000 minutes/month of GitHub Actions.
    • Team accounts: 50,000 minutes/month of GitHub Actions.
  • This automation works perfectly with main Cloud providers such as AWS, Azure and Google Cloud, so you’re not tied to a single Cloud provider.
  • Every time a workflow fails you will automatically receive an email.

As you can see, using GitHub Actions for MLOps has many several pros. However, as you might imagine it is not a perfect solution in all case scenarios. In my opinion, I would not advise using GitHub as an MLOps workflow tool in the following scenarios:

  • You have many models that you must put into production or a few but complex models (such as Deep Learning models). In this case, you will need a lot of computing power to train your models and GitHub Action’s machines are not suited for that. 
  • You are already using a tool that could be used for MLOps. For example, imagine that you are using Apache Airflow for ETL processes. In that case, it probably would be a better idea to use this tool for MLOps rather than building an MLOps pipeline with GitHub.

Well, now that you already know what GitHub Actions is and when you should or should not use it as MLOps pipelines tools, let’s learn how you can build MLOps pipelines with GitHub Actions if you decide to use it.

How to build MLOps Pipelines with GitHub and Google Cloud [step by step guide]

In the following sections, we will discuss how we can build an MLOps pipeline and put it into production with GitHub, GitHub Actions, and Google Cloud. To do so, we build a model that predicts the number of Bitcoin transactions per hour.

More specifically you will learn:

  1. How to set up data extraction pipelines with GitHub Actions.
  2. How to build a model-train and selection pipeline with GitHub Actions.
  3. How to wrap your model as an API.
  4. How to Dockerize your API so that your code is portable and deployable in any Docker-friendly cloud service.
  5. How to set up a continuous deployment pipeline with Cloud and GitHub Actions.
  6. How to automate model retrain with GitHub Actions.

If you already have experience deploying machine learning pipelines, some of the steps may already look familiar to you. I would still encourage you to read them so that you can learn how it is done with GitHub Actions.

Let’s start with the MLOps tutorial with GitHub Actions!

Data extraction pipeline with GitHub Actions

Data extraction pipeline requirements

In order to create a model, we first need to get data. Besides, as we want to create an MLOps pipeline, we will need to create an ETL process that extracts the data, transforms it, and loads it somewhere, like a data lake, data warehouse, or database. By doing so, we will be able to retrain the model with fresh new data anytime and put it into production.

If we want to do this with GitHub Actions, we will need to create one or more scripts that undertake the ETL process. These scripts should be auto executable and, ideally, should also handle exceptions and send warnings in case errors arise.

For example, if you are required to extract data from an external source, it is generally good practice to send a warning to someone when an external source is not working. This will help debug the process and will definitely make things much easier.

Besides, it’s important to note that not all processes should be done in a single script. The only limitation is that the execution of the scripts with GitHub Actions is not parallelizable.

That being send, let’s see how data extraction pipelines with GitHub Actions look in practice:

Example of data extraction pipeline with GitHub Actions

In our example, Blockchain offers this API that shows the number of transactions added to the pool per minute.

Considering this, as we want to predict the number of transactions per hour, our ETL pipeline will consist of the following:

  • Extraction: read the information from the API.
  • Transformation: get the number of hourly transactions by grouping and summing.
  • Load: upload the information to a database to store all the historical information.

That being said, I have summed up all these processes in the following Python script that undertakes the aforementioned steps.

# import statements
import pandas as pd
import requests
from datetime import datetime
from sqlalchemy import create_engine
import os
uri = os.environ.get('URI')

# obtaining the data
url = 'https://api.blockchain.info/charts/transactions-per-second?timespan=all&sampled=false&metadata=false&cors=true&format=json'
resp = requests.get(url)
data = pd.DataFrame(resp.json()['values'])

# parsing the date
data['x'] = [datetime.utcfromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S') for x in data['x']]
data['x'] = pd.to_datetime(data['x'])

# reading the last real date from the database
engine = create_engine(uri)
query = engine.execute('SELECT MAX(reality_date) FROM reality;')
last_reality_date = query.fetchall()[0][0]
query.close()

# reading the last prediction from the database
engine = create_engine(uri)
query = engine.execute('SELECT MIN(prediction_date), MAX(prediction_date) FROM predictions;')
prediction_date= query.fetchall()[0]
query.close()

first_prediction_date = prediction_date[0]
last_prediction_date = prediction_date[1]

if last_reality_date is None:
    date_extract = first_prediction_date

elif  last_reality_date <= last_prediction_date:
    date_extract = last_reality_date

else:
    date_extract = last_reality_date

# rounding hours to get hourly data
data['x'] = data['x'].dt.round('H')

# getting the number of transactions per hour
data_grouped = data.groupby('x').sum().reset_index()

# getting the data from the last data available in the database
data_grouped = data_grouped.loc[data_grouped['x'] >= date_extract,:]

# preparing the data to upload it to the database
upload_data = list(zip(data_grouped['x'], round(data_grouped['y'],4)))
upload_data[:3]

# inserting the data in the database
for upload_day in upload_data:
    timestamp, reality= upload_day
    result = engine.execute(f"INSERT INTO reality(reality_date, reality) VALUES('{timestamp}', '{reality}') ON CONFLICT (reality_date) DO UPDATE SET reality_date = '{timestamp}', reality= '{reality}';")
    result.close()

Finally, I automated this workflow using GitHub Actions, so that every hour the script is executed and new data is inserted into the database. This automation has been done with the following YAML file: 

name: update-ddbb

on:
  schedule:
    - cron: '0 0/6 * * *' #Execute every 6 hours
  workflow_dispatch: 

jobs:
  build:
    runs-on: ubuntu-latest
    steps:

      - name: Access the repo
        uses: actions/checkout@v2 
    
      - name: Configure Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9.7' 
      
      - name: Install necessary libraries
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          
      - name: Execute Python Script
        env: 
          URI: ${{ secrets.URI }}
        run: python update_real_data.py

As you can see, with just two simple files we have built and automated an ETL process. As I said before, you could also do this using other tools like Apache Airflow, but if you don’t have it you can use GitHub Actions.

Now that we have learned how to get the data, now let’s see how to build model training and selection pipelines with GitHub Actions.

Model train and selection pipeline with GitHub Actions

How to build model train and selection pipeline requirements with GitHub Actions

Building a model selection pipeline with GitHub Actions is pretty simple. You just need to create a script that reads the data that you have previously cleaned at the data extraction pipeline and builds several models with it.

After building several models you will need to evaluate them using adequate performance models (you can learn more about performance metrics in Machine Learning in this post).

Once you’ve done this you will have found the best performing model that you have trained. This model is the one that we will use to make predictions and that will be served by our API.

Note: In classification and regression models it is generally a good practice to compare the prediction ability of the new model with the one in production. However, in forecasting models this is not generally needed, as a model with more recent data will generally work better than past models.

Example of model train & selection pipeline

First of all, in order to put the model into production, we must create a script that creates and tunes several models. To do so, I’ve done a grid search using Random Forest as an autoregressive model.

To do so I’ve used the skforecast library, which is a library that uses sklearn’s models as autoregressive models for time series forecasting (you can learn more about it here).

In a real-world scenario, we should not use one single model with some hyperparameter tuning, but rather we should train several models, and tune the hyperparameters of each model. However, to make the tutorial more dynamic, I will just train one model.

Finally, after building the models I will get and save the best performing model and the last training data into a file. These two files will be necessary to make predictions.

At this point, I would recommend saving the information about all the models that have been built in your metadata store such as Neptune. By doing so, you will have information about all the models that have been built and why the one in production has been chosen.

In our example of Bitcoin hourly transactions prediction, this process has been undertaken with the following Python script:

# import general libraries
import pandas as pd
import pickle

# import model training libraries
from utils import create_predictors
from skforecast.model_selection import grid_search_forecaster
from skforecast.ForecasterAutoregCustom import ForecasterAutoregCustom
from sklearn.ensemble import RandomForestRegressor

# imports for data reading
import requests
from datetime import datetime

# imports for Neptune
import os
from dotenv import load_dotenv
import neptune.new as neptune
load_dotenv()

# To get started with Neptune and obtain required credentials refer to this page.
NEPTUNE_API_KEY = os.environ.get('NEPTUNE_API_KEY')
NEPTUNE_PROJECT = os.environ.get('NEPTUNE_PROJECT')


# Data
steps = 36
n_datos_entrenar = 200
path_fichero = 'bitcoin.csv'
path_modelo = 'model.pickle'
uri_mlflow = 'http://104.198.136.57:8080/'
experiment_name = "bictoin_transactions"

# Extract info from Bitcoin
url = 'https://api.blockchain.info/charts/transactions-per-second?timespan=all&sampled=false&metadata=false&cors=true&format=json'
resp = requests.get(url)

data = pd.DataFrame(resp.json()['values'])

# Coerce dates
data['x'] = [datetime.utcfromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S') for x in data['x']]
data['x'] = pd.to_datetime(data['x'])

# renaming columns
data.columns = ['date', 'transactions']

# Get hourly data
data['date'] = data['date'].dt.round('H')
grouped_data = data.groupby('date').sum().reset_index()

# I put data as needed for prediction
grouped_data = grouped_data.set_index('date')
grouped_data = grouped_data['transactions']

# Train test split
train_data = grouped_data[ -n_datos_entrenar:-steps]
test_data  = grouped_data[-steps:]

# Define forecaster
forecaster_rf = ForecasterAutoregCustom(
                    regressor      = RandomForestRegressor(random_state=123),
                    fun_predictors = create_predictors,
                    window_size    = 20
                )

# Define grid search
param_grid = { 'n_estimators': [100, 500], 'max_depth': [3, 5, 10] }

grid_results = grid_search_forecaster(
                        forecaster  = forecaster_rf,
                        y           = train_data,
                        param_grid  = param_grid,
                        steps       = 10,
                        method      = 'cv',
                        metric      = 'mean_squared_error',
                        initial_train_size    = int(len(train_data)*0.5),
                        allow_incomplete_fold = True,
                        return_best = True,
                        verbose     = False
                    )

# Upload metadata to Neptune
for i in range(grid_results.shape[0]):

  run = neptune.init(
      project= NEPTUNE_PROJECT,
      api_token=NEPTUNE_API_KEY,
  ) 
  
  params = grid_results['params'][i]
  run["parameters"] = params
  run["mean_squared_error"] = grid_results['metric'][i]
  
  run.stop()

# Save model locally
last_training_date = test_data.index[-1].strftime('%Y-%m-%d %H:%M:%S')
pickle.dump(last_training_date, open('last_training_date.pickle', 'wb'))
pickle.dump(forecaster_rf, open(path_modelo, 'wb'))
Neptune with Github Actions example
Tracked runs visible in the Neptune UI | Source

Now that we have built the model, let’s see how to wrap it as an API to make the model serving.

Wrapping the model as an API

Once the model has been created and in order to put it into production, we will create an API that receives the model parameters as input and returns the predictions.

Also, in addition to returning the predictions, it is important that the API saves both the input data and the predictions in a database. In this way, we can later compare the prediction with reality and thus see how our model is behaving and, in case it does not behave properly, we can retrain it.

How you wrap the model in an API will depend on the type of language you are using, as well as any preferences you have. In the case of working with R, the most normal thing is that you use the plumber library, while in Python you have several libraries such as FastAPI or Flask.

Having said that, let’s see how to approach this point in our example.

Example of wrapping the model as an API

In order to create an API with Python, I have chosen to use FastAPI as the API generation framework, since it allows the API to be created easily and also checks the data types of the inputs.

Also, in order to save the predictions of the model, I have created a table in a Postgres database. By doing so, as real data is in one table and predictions are in another, I’ve created a view where I have:

  1. The time for which I am making the prediction
  2. The prediction I made for that hour.
  3. The real number of transactions in that hour.
  4. The Absolute Error of that prediction.

By doing so, creating a visualization to show the performance power of the model over time will be as simple as connecting to this view and showing some charts and KPIs.

Regarding the API, what happens inside is quite simple, you simply have to load the model, make the prediction and insert it into the Database.

However, in our case, as it is a time series model, it has an extra layer of complexity, since we may make a prediction on a date for which we already had a previous prediction. In this case, it is not enough to make an insert in the database, but the previous value will have to be replaced.

So, in the following script you can find how I have approached the creation of the API:

from fastapi import FastAPI
app = FastAPI()
@app.post("/forecast")
def forecast(num_predictions = 168, return_predictions = True):
    
    import pandas as pd
    import requests
    from datetime import datetime
    from sqlalchemy import create_engine
    import pickle
    import os
    uri = os.environ.get('URI')
    
    # Load Files
    forecaster_rf = pickle.load(open('model.pickle', 'rb'))
    last_training_date = pickle.load(open('last_training_date.pickle', 'rb'))
    last_training_date = datetime.strptime(last_training_date, '%Y-%m-%d %H:%M:%S') 

    # I obtain the data 
    url = 'https://api.blockchain.info/charts/transactions-per-second?timespan=all&sampled=false&metadata=false&cors=true&format=json'
    resp = requests.get(url)
    data = pd.DataFrame(resp.json()['values'])

    # I correct the date
    data['x'] = [datetime.utcfromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S') for x in data['x']]
    data['x'] = pd.to_datetime(data['x'])
    
    # I read the last prediction
    engine = create_engine(uri)
    query = engine.execute('SELECT MAX(prediction_date) FROM predictions;')
    last_prediction_date= query.fetchall()[0][0]
    query.close()        

    # If there is no last date in the databse or training > database, I read the last date from training
    if  (last_prediction_date is None) or (last_prediction_date > last_training_date):
        
        # As there is no predictions, I make the predictions
        predictions = forecaster_rf.predict(num_predictions)

        fechas = pd.date_range(
            start = last_training_date.strftime('%Y-%m-%d %H:%M:%S'),
            periods = num_predictions,
            freq = '1H'
            )

    elif last_prediction_date > last_training_date:
        # In this case, we must take into account the differences between the last forecast date and add the difference to the number of days to extract.
        dif_seg= last_prediction_date - last_training_date
        hours_extract = num_predictions + dif_seg.seconds//3600
        predictions = forecaster_rf.predict(num_predictions)
        # I get the last predictions
        predictions = predictions[-num_predictions:]
        
        fechas = pd.date_range(
            start = last_prediction_date.strftime('%Y-%m-%d %H:%M:%S'),
            periods = num_predictions,
            freq = '1H'
            )
    else:
        # If last training > last predictions
        predictions = forecaster_rf.predict(num_predictions)

        fechas = pd.date_range(
            start = last_training_date.strftime('%Y-%m-%d %H:%M:%S'),
            periods = num_predictions,
            freq = '1H'
            )
    
    upload_data = list(zip([
    datetime.now().strftime('%Y-%m-%d %H:%M:%S')] * num_predictions,
    [fecha.strftime('%Y-%m-%d %H:%M:%S') for fecha in fechas ],
        predictions
    ))

    # I insert the data
    for upload_day in upload_data:
        timestamp, fecha_pred, pred = upload_day
        pred = round(pred, 4)

        result = engine.execute(f"INSERT INTO predictions (timestamp, prediction_date,  prediccion)\
            VALUES('{timestamp}', '{fecha_pred}', '{pred}') \
            ON CONFLICT (prediction_date) DO UPDATE \
            SET timestamp = '{timestamp}', \
                prediccion = '{pred}'\
            ;")
        result.close()
    if return_predictions:
        predictions
    else:
        return 'New data inserted'

Now that we have the API created, we enter the MLOps part, seeing how to dockerize the model to put it into production. Let’s go for it!

Dockerizing the API

Once we have created the API that returns predictions, we are going to include it in a Docker container. By doing so, we will be able to put our code into production in any environment with Docker or Kubernetes, making our code much more portable and independent.

Note: If you are new to Docker, I recommend you read this tutorial on how to use Docker for Data Science. On the other hand, if you already have some experience, surely this post on Docker best practices will interest you.

Thus, to Dockerize our API we have to create a Dockerfile since it is the file that tells Docker how to build the image. In this sense, it is important that the Dockerfile contains the following points:

  1. Installation of the programming language and API framework that we use.
  2. Installation of the necessary libraries to execute our code correctly.
  3. Copy the API, the model and all the necessary files for the API to run correctly.
  4. Execute the API in the port that we want.

Once we have created our Dockerfile, it is important that we verify that it works correctly. To do this, we must execute the following commands:

cd <folder_where_Dockerfile_is_located>
docker build -t <image_name> .
docker run -p <port:port> <image_name>

After that, we can access our port and check if the API works correctly or not. If it does, we can move on to the next step: putting our model into production. 

Setting up a continuous deployment pipeline with Cloud and GitHub

This is where we start the MLOps process with GitHub Actions. So when doing MLOps with GitHub Actions, what we will do is, connect our GitHub repository with our Cloud provider in such a way that, every time we make a push to our repo in GitHub, the Docker image gets built and deployed in the cloud service that we want. And, as you might imagine, everything will happen automatically.

In other words, we will no longer have to deploy our code manually in our Cloud environment, but rather this will be done automatically every time we push to our repository. Isn’t it cool?

The good thing is that for this, the process is always the same:

  1. Connect our Cloud service with GitHub, in such a way that, with each push, the Docker image is uploaded to the Container Registry of our Cloud service.
  2. Create a file that tells our Cloud service what steps to follow with that Docker image.

In the case of Google Cloud, we will use the following tools:

  • Google Cloud Build: service used to build docker images.
  • Container Registry: service to store containers.
  • Cloud Run: service that deploys containerized APIs to a service that scales down to 0 automatically, i.e. if the service does not receive any requests, no virtual machine will be running. This means you will only pay when you make requests to services deployed in Cloud Run.

The main idea is to connect Cloud Build with GitHub so that every time there is a push to the repo, Cloud Build builds the image. Then, run a script that pushes that image to the container registry and deploys it from the container registry to Cloud Run.

MLOps with GitHub Actions: deployment workflow | Source: Author

In order to do this, we must follow these steps:

  1. Access the Triggers section of the Cloud Build service
  2. At the bottom of the page, click the Create trigger button, as shown in the image below:
Github actions create trigger
  1. In the Event section, we choose what we want the trigger to be. In our case we leave it to the default value: Push to a branch.  
Github actions repo event
  1. In Source under Repository, click on the “Connect new Repository” button, as shown in the image below and select GitHub as your repository.  
Github actions source
  1. A window will prompt to authorize the Cloud Build application on your GitHub Account. We will need to authorize it and install the application in a specific repo or all repositories.
Google Cloud Build
  1. Select the repository for the MLOPs process. Build to the repository from the Cloud Build triggers menu.
  2. Select Cloud Build configuration mode. In this case, we must choose the Cloud Build configuration file option, as shown in the image below:
Google Cloud Build configuration
  1. Finally, we choose a service account and click on the Create button. 

Perfect, we have now connected our GitHub account with the Google Container Registry. This automation will be executed every time we push the container.

However, we still have to define one important and often overlooked thing: the Cloud Build configuration file. Let’s see how to do it.

Specifying Cloud Build configuration file

The Cloud Build configuration file is the file that tells Cloud Build what cloud commands to execute each time the trigger we have indicated is fired.

In our case we want only three commands to be executed:

  1. Creation of the Docker image using the files from our GitHub repository.
  2. Uploading the Docker image to the Google Cloud Container Registry.
  3. Deployment of the docker image in a Google Cloud service, either Kubernetes or Cloud Run (in our case we will use the latter).

We will define all this in a .yaml file, indicating each of these points as a step to execute. In the following code we see the .yaml file that I used in the example of Bitcoin transactions:

steps:
- name: 'gcr.io/cloud-builders/docker'
   args: ['build', '-t', 'gcr.io/mlops-example/github.com/anderdecidata/mlops-example:$SHORT_SHA', '.']
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'gcr.io/mlops-example/github.com/anderdecidata/mlops-example:$SHORT_SHA']
- name: 'gcr.io/cloud-builders/gcloud'
  args: ['beta', 'run', 'deploy', 'mlops-example', '--image=gcr.io/mlops-example/github.com/anderdecidata/mlops-example:$SHORT_SHA', '- -region=europe-west1', '--platform=managed']

Also, the good thing is that this process works for all of the three main Cloud environments: AWS, Azure and Google Cloud. Although we have only discussed Google Cloud, here is a brief explanation of how it would be done with other services:

  • MLOPs workflow with GitHub and Azure: you will need to create a GitHub Action that runs with every push and that logins with Azure CLI, builds and pushes the image and then deploys it. You can find an example here.
  • MLOPs workflow with GitHub and AWS: similar to Azure, you will need to create a GitHub Action that runs with every push, logs in to AWS ECR and pushes the image. You can find an example here.

With this, we have created our MLOps pipeline with GitHub Actions. In this way, every time we push a new model to our GitHub, it will be put into production automatically.

However, this is not exactly ideal, as we will need to manually retrain the model periodically. So, let’s see how to take our MLOps GitHub pipeline to the next level. 

Automating model retrain with GitHub Actions

With everything we have seen so far, to put a retrained model into production we will simply have to create a script that:

  1. Runs the model training file.
  2. Pushes the new model to our GitHub repository.

This way, when the push is made, Google Cloud Build will detect it automatically and use the new model to build the image, upload it to the Cloud Container Registry, and finally deploy it to Cloud Run.

As you might have guessed, we can automate the execution of the model retraining using GitHub Actions. We will simply have to create a GitHub Action that periodically executes the training file and pushes the new files to GitHub.

However, this is not ideal at all. Because, why should we retrain a model if the predictions are good enough? Or why should we wait until the next model retrains if the current model is returning bad predictions?

Thus, if we want to go one step further we can retrain the model when the predictive capacity of the model is not good. To do so we will only have to:

  1. Make the model retrain workflow trigger with an http request.
  2. Create a script that checks the predictive capacity of the model. In the event that the predictive capacity of the model is less than the threshold we have set, the script will execute the retraining workflow through an http call to it.

The latter is how I have it assembled in the MLOps workflow with GitHub for predicting the number of transactions with Bitcoin. With the following code we can check whether the MAE is lower than a specific threshold or not and launch a retraining workflow in case it is: 

# Objective: check if the model should be retrained or not
# imports
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime, timedelta
import requests
import json
import os

# setting variables
user = 'anderDecidata'
repo = 'Ejemplo-MLOps'
event_type = 'execute-retrain'
GITHUB_TOKEN = os.environ.get('TOKEN')
uri = os.environ.get('URI')
max_mae = 6
n_observations_analyze = 48

# N days to substract
days_substract = round(n_observations_analyze/24)

# creating the engine
engine = create_engine(uri)

# getting the largest date
resp = engine.execute('SELECT MAX(fecha) FROM tablon;')
largest_date = resp.fetchall()
resp.close()

# calculating the initial date
initial_date = largest_date[0][0] - timedelta(days = days_subsctract)

# getting the data from the initial date
resp = engine.execute(f"SELECT * FROM tablon WHERE fecha >'{initial_date}';")
data = resp.fetchall()
colnames = resp.keys()
resp.close()

# converting it to Data Frame
data = pd.DataFrame(data, columns=colnames)

# getting mean MAE
print(data['mae'].mean())

# If mean mae exceeds the threshold, call Github Actions to retrain the model
if data['mae'].mean() > max_mae:
    url = f'https://api.github.com/repos/{user}/{repo}/dispatches'
    resp = requests.post(url, headers={'Authorization': f'token  {GITHUB_TOKEN}'}, data = json.dumps({'event_type': event_type}))

Besides, we can automate the execution of this script with the following GitHub Action:

name: Check retrain

on:
  schedule:
    - cron: '0 0/2 * * *' #Execute every 2 hours 
  workflow_dispatch: 

jobs:
  build:
    runs-on: ubuntu-latest
    steps:

      - name: Checkout repo
        uses: actions/checkout@v2 
    
      - name: Configure Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9.7' 
      
      - name: Install libraries
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          
      - name: Execute python script 
        env:
          URI: ${{ secrets.URI }}
          TOKEN: ${{secrets.TOKEN}}
        run: python check_retrain.py

Conclusion 

As you may have seen during this tutorial, an MLOps workflow can be created with GitHub Actions in a relatively simple way, just by adding a few tools to our current work tools.

In my opinion, performing MLOps with GitHub is a very good way to do the continuous deployment of models, especially in those organizations where Machine Learning and advanced analytics are not emphasized enough or do not have many data-specific tools.

I hope that the tutorial has helped you to gain knowledge about how you can build an MLOps pipeline with GitHub. If you have never done one, I personally recommend you create it, as it is a very good way to learn.

Besides, if you would like to learn other tools rather than GitHub Actions to build MLOps pipelines, I would encourage you to learn about CircleCI or Gitlab CI. These two tools are alternatives that companies used instead of GitHub Actions, but the way of building MLOps pipelines is the same as the one explained in this post.

Related materials:


READ NEXT

The Best MLOps Tools and How to Evaluate Them

12 mins read | Jakub Czakon | Updated August 25th, 2021

In one of our articles—The Best Tools, Libraries, Frameworks and Methodologies that Machine Learning Teams Actually Use – Things We Learned from 41 ML Startups—Jean-Christophe Petkovich, CTO at Acerta, explained how their ML team approaches MLOps.

According to him, there are several ingredients for a complete MLOps system:

  • You need to be able to build model artifacts that contain all the information needed to preprocess your data and generate a result. 
  • Once you can build model artifacts, you have to be able to track the code that builds them, and the data they were trained and tested on. 
  • You need to keep track of how all three of these things, the models, their code, and their data, are related. 
  • Once you can track all these things, you can also mark them ready for staging, and production, and run them through a CI/CD process. 
  • Finally, to actually deploy them at the end of that process, you need some way to spin up a service based on that model artifact. 

It’s a great high-level summary of how to successfully implement MLOps in a company. But understanding what is needed in high-level is just a part of the puzzle. The other one is adopting or creating proper tooling that gets things done. 

That’s why we’ve compiled a list of the best MLOps tools. We’ve divided them into six categories so you can choose the right tools for your team and for your business. Let’s dig in!

Continue reading ->
Feature store and data ingestion mlops

How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack

Read more
ML pipeline problems solutions

Building ML Pipeline: 6 Problems & Solutions [From a Data Scientist’s Experience]

Read more
Recommender system lessons

Recommender Systems: Lessons From Building and Deployment

Read more
MLOps pillars

Pillars of MLOps and How to Implement Them

Read more