MLOps Blog

Serving Machine Learning Models With Docker: 5 Mistakes You Should Avoid

Akinwande Komolafe

11 min

27th July, 2023

MLOps

As you would already know that Docker is a tool that allows you to create and deploy isolated environments using containers for running your applications along with their dependencies. While we are at it let us briefly brush up on some basic concepts around Docker before we make way for the main topic.

Why should data scientists containerize ML models?

Have you ever trained a machine learning model and then decided to share your code with a colleague only to later find out that your code keeps breaking even though it works just fine on your laptop. Most of the time it could be a package compatibility issue or an environment issue. A good solution to this problem is using Containers.

Containers offer:

Reproducibility – by containerizing your machine learning models, you can ship your codes to any other system where Docker is installed and expect your application to give you similar results as it did when you tested locally.

Collaborative development – containerized machine learning models allow teammates to collaborate and also this makes it easier for version control.

Using Docker to serve your machine learning models

Now that you know why you need to containerize your machine learning models, the next thing is to understand how you can containerize your models.

Some Docker-associated terminology that you might be already aware of and come across in this article:

Dockerfile: you can think of a Dockerfile as describing how you would like to set up the operating system installation of the system you want to run. It contains all the codes you need to set up a Docker container, from downloading the Docker image to set up the environment.

Docker image: it is a read-only template containing a list of instructions for creating a Docker container.

Docker container: a container is a runnable instance of a Docker image.

*Basic Docker commands | Source: Author*

When creating your Dockerfile, there are some best practices to consider like avoiding installing unnecessary libraries or packages when building your Docker image, reducing the number of layers in your Dockerfile and lots more. Check out the below article for best practices while using Docker.

How to serve machine learning models?

The important concept of model serving is to host machine-learning models (on-premises or in the cloud) and make their functionalities available through API so that companies can integrate AI into their systems.

There are two sorts of model serving in general: batch and online.

Batch predictions indicate that the input to your model is a huge quantity of data, often as a scheduled operation, and the predictions can be published as a table.

Online deployment entails deploying the model with an endpoint so that applications can submit requests to the model and receive a quick response with minimal latency.

Important requirements to consider when serving ML models

Traffic management

Depending on the destination service, requests at an endpoint take different paths. To process requests concurrently, traffic management may also deploy a load-balancing feature.

Monitoring

It is important to monitor machine learning models deployed in production. By monitoring ml models we can detect when the performance of the model deteriorates and when to retrain the models. A machine learning life cycle is incomplete without model monitoring.

Data preprocessing

For real-time serving, the machine learning models require the inputs into the model be in a suitable format. There should be a dedicated transformation service for the purpose of data preprocessing.

There are different tools you can use to serve your machine learning models in production. You can check out this article for a comprehensive guide on the different machine learning tools/platforms you can use for model serving.

Explore the tools

Best Tools to Do ML Model Serving

Mistakes you should avoid when serving your machine learning models with Docker

Now you understand what model serving means and how you can serve your models using Docker. It is important to know what to do and what not to do when serving your machine learning models with Docker.

Operational mistakes are the most common mistakes data scientists make when deploying their machine learning models with Docker. This mistake often leads to poor ML service performance of the application. An ML application is measured by its overall service performance – it should have a low inference latency, low service latency, and good monitoring architecture.

Mistake 1: Using REST API instead of gRPC when serving machine learning models with TensorFlow Serving and Docker

TensorFlow serving was developed by Google developers and it provides an easier way of deploying your algorithms and running experiments.

To learn more on how to serve your ML models using TensorFlow serving with Docker, check out this post.

When serving your machine learning models with TensorFlow serving, you need to understand the different types of endpoints Tensorflow serving offers and when to use them.

gRPC and REST API endpoints

gRPC

Is a communication protocol that was created by Google. It uses a protocol buffer as its messaging format and it is highly packed, highly-efficient for serializing structured data. With pluggable support for load balancing, tracing, health checking, and authentication, it can efficiently connect services within and across data centers.

REST

Most web applications use REST as a communication protocol. It illustrates how clients communicate with web services. Although REST remains a great way to exchange data between clients and servers, it has its draw back which is speed and scalability.

Differences between gRPC and REST API

gRPC and REST API have different characteristics in how they operate. This table compares the different characteristics of both APIs

Characteristic	gRPC	REST
HTTP protocol	HTTP 2	HTTP 1.1
Messaging Format	Protobuf (protocol buffers)	JSON
Communication	Bi-Directional streaming	Request-Response

As illustrated in the diagram below, most serving API requests arrive using REST. The preprocessing and postprocessing steps occur inside the API before sending the preprocessed data to Tensorflow serving using either RESTful APIs or gRPC APIs for predictions.

*How to use gRPC for model serving | Source: Author*

Most data scientists often utilize the REST API for model serving, however, it has its shortcomings. The major ones are speed and scalability. The time it takes for your model to make a prediction after being fed input is referred to as ML Inference Latency. To improve the user experience on your application, it is essential that your ML service returns predictions quickly.

For small payloads, either API yields similar performance meanwhile AWS Sagemaker demonstrated that for computer vision tasks like image classification and object detection, using gRPC inside of a Docker endpoint reduces overall latency by 75% or more.

Deploying your machine learning model using gRPC API with Docker

Step 1: Ensure Docker is installed on your PC

Step 2: To use Tensorflow serving, you need to pull the Tensorflow serving Image from the container repository.

docker pull tensorflow/serving

Step 3: Build and train a simple model

import matplotlib.pyplot as plt
import time
from numpy import asarray
from numpy import unique
from numpy import argmax
from tensorflow.keras.datasets.mnist import load_data
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPool2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout

#load MNIST dataset
(x_train, y_train), (x_test, y_test) = load_data()
print(f'Train: X={x_train.shape}, y={y_train.shape}')
print(f'Test: X={x_test.shape}, y={y_test.shape}')

# reshape data to have a single channel
x_train = x_train.reshape((x_train.shape[0], x_train.shape[1], x_train.shape[2], 1))
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], x_test.shape[2], 1))

# normalize pixel values
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# set input image shape
input_shape = x_train.shape[1:]

# set number of classes
n_classes = len(unique(y_train))

# define model
model = Sequential()
model.add(Conv2D(64, (3,3), activation='relu', input_shape=input_shape))
model.add(MaxPool2D((2, 2)))
model.add(Conv2D(32, (3,3), activation='relu'))
model.add(MaxPool2D((2, 2)))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))

# define loss and optimizer
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit the model
model.fit(x_train, y_train, epochs=10, batch_size=128, verbose=1)

# evaluate the model
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print('Accuracy: %.3f' % acc)

Step 4: Save the Model

When saving your TensorFlow model you can save it as a protocol buffer file Saving the model into a protocol buffer file by passing “tf” in the save_format argument.

file_path = f"./img_classifier/{ts}/"
model.save(filepath=file_path, save_format='tf')

Saved models can be investigated using the saved_model_cli command.

!saved_model_cli show --dir {export_path} --all

Step 5: Serving the model using gRPC

You need to install the gRPC library.

Import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorboard.compat.proto import types_pb2

You need to establish a channel between the client and server using port 8500.

channel = grpc.insecure_channel('127.0.0.1:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

The request payload for the server needs to be set as a Protocol buffer by specifying the name of the model, path where the model is stored, data type to expect and the number of records in the data.

request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist-model'
request.inputs['flatten_input'].CopyFrom(tf.make_tensor_proto(X_test[0],dtype=types_pb2.DT_FLOAT,  shape=[28,28,1]))

Finally, to deploy your model with Docker, you need to run the Docker container.

docker run -p 8500:8500 --mount type=bind,source=<absolute_path>,target=/models/mnist-model/ -e MODEL_NAME=mnist -t tensorflow/serving

Now the server can now accept client requests. Call the Predict method from the stub to predict the outcome of the request.

stub.Predict(request, 10.0)

Following the steps above, you will be able to serve your TensorFlow serving model with gRPC API.

Mistake 2: Preprocessing your data when serving your machine learning models with Docker

Another mistake developers make when serving their machine learning models using Docker is preprocessing their data in real-time before making predictions. Before an ML model provides a prediction, it expects that the data points must include all the input features that were used while training the algorithm.

For example, if you train a linear regression algorithm to estimate the price of a house based on its size, location, age, number of rooms, and orientation, the trained model will require those features’ values as inputs during inference in order to provide an estimated price.

In most cases, the input data needs to be preprocessed and cleaned and some features would even have to be engineered. Now imagine doing this in real-time every time your model endpoint is triggered, the implication is repetitive preprocessing for some features, especially static features and high ML model latency. In this case, the feature store proves to be an invaluable resource.

What is a feature store?

Feature stores relate to storage and are used for storing and serving features across several pipeline branches thus enabling shared computation and optimizations.

Importance of using feature stores when serving ml models in Docker

Data scientists can use feature stores to smoothen the way features are maintained, paving the way for more efficient processes while ensuring that features are properly stored, documented, and tested.

The same features are used in many projects and research assignments across a company. Data scientists can use a feature store to quickly access the features they require and avoid doing repetitive work.

When serving your machine learning model, in order to invoke the model for prediction, two types of input features are fetched in real-time:

Static reference: These feature values are either static or gradually changing attributes of the entity for which a prediction is required. This includes descriptive attributes such as customer demographic information. It also includes customer purchase behavior like how much they spend, how often they spend, etc.

Real-time dynamic features: These feature values are captured and computed dynamically based on real-time events. These features are calculated in real-time, usually in an event-stream processing pipeline.

Feature serving API makes feature data available to models in production. The serving API was created with low latency access to the most recent feature values in mind. To understand feature stores better and to know about the different feature stores available, check out this article: Feature Stores: Components of a Data Science Factory.

Mistake 3: Using IP addresses to communicate between Docker containers

Finally, you have deployed your machine learning model with Docker and your application is returning predictions in a production environment but due to some reasons, you need to make an update on the container. After making the necessary changes and restarting your containerized application, you keep getting “Error: connect ECONNREFUSED” .

Your application cannot establish a connection to the database even though it was working perfectly fine before. Each container has its own internal IP address which changes whenever the container is restarted. The mistake data scientists make is using the default networking driver for Docker, bridge, to communicate between containers. All containers within the same bridge network can communicate with each other through IP addresses. Because IP addresses fluctuate, this is obviously not the best approach.

How do you communicate between Docker containers without using an IP address?

To communicate with containers, you should use environment variables to pass the host name instead of the IP address. You can do this by creating a user-defined bridge network.

*How to create a user-defined bridge network | Source*

You will need to create your own custom bridge network. You can do this by running the command Docker network create. Here we create a network with the name “dummy-network”.

Docker network create dummy-network

Run your container normally with the docker run command. Add it to your user-defined bridge network with the —net option. You can also add an alias using the –name option.

docker run --rm --net dummy-network --name tulipnginx -d nginx

Connect another container to the custom bridge network you created.

docker run --net dummy-network -it busybox

Now you can connect to any container using the container host names provided they are on the same custom bridge network, without worrying about restarts.

Mistake 4: Running your processes as root users

A lot of data scientists make this mistake of running their processes as root users and I will explain why it is wrong and recommend solutions. When designing systems, it is important to adhere to the principle of least privilege. This means that an application should only have access to the resources it requires to complete its task. Granting exactly the least amount of privileges required for a process to execute is one of the best strategies to protect oneself against any unexpected intrusion.

Because the majority of containerized processes are application services, they do not require root access. Containers do not require root to run, but Docker does. Docker images that are well-written, safe, and reusable should not expect to be run as root and should provide a predictable and simple way to limit access.

By default when you run your containers, it assumes root user. I also made this mistake of always running my processes as root user or always using sudo to get things done. But I have learned that having unnecessary permissions than it is required, can lead to catastrophic issues.

Let me demonstrate this via an example. This is a sample dockerfile I used for a project in the past.

FROM tiangolo/uvicorn-gunicorn:python3.9

RUN mkdir /fastapi

WORKDIR /fastapi

COPY requirements.txt /fastapi

RUN pip install -r /fastapi/requirements.txt

COPY . /fastapi

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The first thing is to build a Docker image and run the Docker container, you can do this with this command

docker build -t getting-started .

docker run -d p 8000:8000 getting-started

Next is to obtain the containerID and you can do this by checking your Docker container process with docker ps, then you can run the whoami command to see what user has access to the container.

Running your processes as root users — *Source: Author*

An attacker can get root access to the container if the application has a vulnerability. The user has root privileges inside the container and can do whatever they want. An attacker can use this to not only interfere with the program but also to install extra tools that can be used to pivot to other devices or containers.

How to run Docker as non-root user

Using a dockerfile:

##########################################
# Dockerfile to change from root to
# non-root privilege
###########################################

FROM debian:stretch

# You can add a new user "User-tesla" with user id 1099
RUN useradd -u 1099 user-tesla
# Change to non-root privilege
USER user-tesla

As a container user, the level of support for changing users is at the mercy of the container maintainers. Using the —user parameter, Docker allows you to change the user (or user key in docker-compose.yml). The user id of the user to whom the process should be changed is supplied as an argument. This limits any unwanted access.

Mistake 5: Not monitoring model versions when serving ML models with Docker

One operational mistake data scientists make is not tracking changes or updates made to an ML system before deploying it to production. Model versioning helps ML engineers understand what has changed in the model, what features have been updated by researchers, and how features have changed. Know what changes have been made and how they could affect the speed plus ease of deployment when integrating multiple features.

Advantages of model versioning

Model versioning helps track the different model files that you have previously deployed to production, and by doing so, you can achieve:

Model lineage traceability: if a recently deployed model was performing poorly in production, you could redeploy a previous version of the model that was performing better.

Model registry: tools like neptune.ai and MLFlow can serve as a model registry making it easy for you to log their model files. Whenever you need the model for serving, you can fetch the model and the specific version.

Using neptune.ai for model versioning and deploying with Docker

neptune.ai allows you to keep track of your experiments, hyperparameter values, the dataset used for the particular experiment run, and the model artifacts. neptune.ai provides a python SDK that you can use when building your machine learning models.

The first step is ensuring you have the Neptune python client installed. Depending on your operating system, open your terminal and run this command:

pip install neptune

After training your model, you can register it in Neptune to track any related metadata. First, you need to initialize a Neptune model object. The model object is appropriate for holding generic metadata shared by all model versions during the training process.

import neptune
model = neptune.init_model(project='<project name>’',
    name="<MODEL_NAME>",
    key="<MODEL>",
    api_token="<token>"
)

This will generate a URL to the Neptune dashboard where you can see the different models you have created. Check out the workspace.

How to create a model version in neptune.ai — *ML model logged in neptune.ai | Source*

To create a model version in Neptune, you need to have registered your model in the same Neptune project, and you can find your models under the Models tab on the dashboard.

To create a model version on Neptune, you need to run this command:

import neptune
model_version = neptune.init_model_version(
    model="MODEL_ID",
)

The next thing is storing any associated model metadata and artifacts, and you can do this by assigning them to the model object you created. To understand how you can log your model metadata, check out this documentation page.

Now you can see the different versions of the model you have created, the associated metadata for each model version, and the model metrics. You can also manage the model stages for each model version. From the image above, DOC-MODEL-1 has been deployed to production. This way, you can see what model version is currently deployed to production and the associated metadata for such model.

When building your machine learning model, you should not store associated metadata like hyperparameters, comments, and config data as files in the Docker container. When a container is stopped, destroyed and replaced, you can lose all the associated data in the container. Using Neptune-client, you can log and store all your associated metadata for every run.

How to monitor model versions when serving in Docker with Neptune

Since Neptune manages your data by creating models, creating model versions, and managing model staging transitions, you can query and download your stored models, using Neptune as a model registry.

Create a new script for serving and importing the necessary dependencies. All you need to do is specify the model version you need to serve in production. You can run your Docker container by passing your NEPTUNE_API_TOKEN and your MODEL_VERSION as a Docker environment variable:

import neptune
import pickle,requests

api_token = os.environ['NEPTUNE_API_TOKEN']
model_version = os.environ['MODEL_VERSION']


def load_pickle(fp):

   """
   Load pickle file(data, model or pipeline object).
   Parameters:
       fp: the file path of the pickle files.

   Returns:
       Loaded pickle file
   """
   with open(fp, 'rb') as f:
       return pickle.load(f)

def predict(data):
   #####
   input_data = requests.get(data)
   #####
   model_version = neptune.init_model_version(project='docker-demo',
   version=model_version,
   api_token=api_token
   )
   model_version['classifier']['pickled_model'].download()
   model = load_pickle('xgb-model.pkl')
   predictions = model.predict(data)
   return predictions

You can containerize your machine learning model service using Docker by creating a Dockerfile and providing a list of dependencies on your requirements.txt file.

neptune
sklearn==1.0.2

# syntax=docker/dockerfile:1
FROM python:3.8-slim-buster

RUN apt-get update
RUN apt-get -y install gcc

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .
CMD [ "python3", "-W ignore" ,"src/serving.py"]

To build a Docker image from the dockerfile above, you need to run this command:

docker build --tag <image-name> . # image-name: neptune-docker

docker run -e NEPTUNE_API_TOKEN="<YOUR_API_TOKEN>"  -e MODEL_VERSION =”<YOUR_MODEL_VERSION>” <image-name>

There are several alternatives to managing data on a Docker container, you can bind-mount directories during development. It is a good alternative for debugging your codes. You can do this by running this command:

docker run -it <image-name>:<image-version> -v /home/<user>/my_code:/code

You may now debug and execute the code in the container at the same time, and the changes will be mirrored on the host. This brings us back to the advantage of utilizing the same host user ID and group ID across your container. All of the modifications you make will appear to have come from the host’s user.

To start up your Docker containers, you will need to run this command:

docker run -d -e NEPTUNE_API_TOKEN="<YOUR_API_TOKEN>"  -e MODEL_VERSION =”<YOUR_MODEL_VERSION>” <image-name>

The -d option specifies that the containers should be launched in daemon mode.

Final thoughts

Reproducibility and collaborative development are the most important reasons why data scientists should deploy their models with Docker containers. TensorFlow serving is one of the popular tools for model serving, and you can extend it to serve other types of models and data. Also, when serving machine learning models with TensorFlow serving, you need to understand the different client APIs and choose the most suitable for your use case.

Docker is a good tool for deploying and serving models in production. Nonetheless, it is crucial to identify the mistakes many data scientists make and avoid making similar mistakes.

The mistakes data scientists make when serving machine learning models with Docker revolve around model latency, application security, and monitoring. Model latency and model management are significant parts of your ML system. A good ML application should return predictions when they receive a request. By avoiding these mistakes, you should be able to deploy a working ML system with Docker effectively.