Machine learning (ML) has the potential to greatly improve businesses, but this can only happen when models are put in production and users can interact with them.
Global companies like Amazon, Microsoft, Google, Apple, and Facebook have hundreds of ML models in production. From better search to recommendation engines and as far as 40% reduction of data centre cooling bill, these companies have come to rely on ML for many key aspects of their business. Putting models in production is not an easy feat, and while the process is similar to traditional software, it has some subtle differences like model retraining, data skew or data drift that should be put into consideration.
The process of putting ML models is not a single task, but a combination of numerous sub-tasks each important in its own right. One of such sub-tasks is model serving.
“Model serving is simply the exposure of a trained model so that it can be accessed by an endpoint. Endpoint here can be a direct user or other software.”
In this tutorial, I’m going to show you how to serve ML models using Tensorflow Serving, an efficient, flexible, high-performance serving system for machine learning models, designed for production environments.
Specifically, you will learn:
- How to install Tensorflow serving with docker
- Train and save a simple image classifier with Tensorflow
- Serve the saved model using Tensorflow Serving
At the end of this tutorial, you will be able to take any saved Tensorflow model and make it accessible for others to use.
Before you start (prerequisite)
In order to fully understand this tutorial, it is assumed that you:
- Have a basic understanding of deep learning with Tensorflow/Keras
- Have Python (3.5>) and Tensorflow (2.0 >) installed on your local system
Below are some links to help you get started:
Now, let’s talk briefly about Tensorflow Serving (TF Serving).
Introduction to Tensorflow Serving
“TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments while keeping the same server architecture and APIs. TensorFlow Serving provides out of the box integration with TensorFlow models but can be easily extended to serve other types of models.”
— Source
Put simply, TF Serving allows you to easily expose a trained model via a model server. It provides a flexible API that can be easily integrated with an existing system.
Most model serving tutorials show how to use web apps built with Flask or Django as the model server. While this is okay for demonstration purposes, it is highly inefficient in production scenarios.
According to Oreily’s “Building Machine Learning Pipelines” book, some of the reasons why you should not rely on traditional web apps to serve ML models include:
- Lack of efficient model version control: Properly versioning trained models are very important, and most web apps built to serve models may miss this part, or if present, may be very complicated to manage.
- Lack of code separation: Data Science/Machine learning code becomes intertwined with software/DevOps code. This is bad because a data science team is mostly different from the software/DevOps team, and as such proper code management becomes a burden when both teams work on the same codebase.
- Inefficient model inference: Model inference in web apps built with Flask/Django are usually inefficient.
Tensorflow Serving solves these problems for you. It handles the model serving, version management, lets you serve models based on policies, and allows you to load your models from different sources. It is used internally at Google and numerous organizations worldwide.
Did you know that you can keep track of your model training thanks to TensorFlow + Neptune integration? Read more about it in our docs.
TensorFlow Serving architecture
In the image below, you can see an overview of TF Serving architecture. This high-level architecture shows the important components that make up TF Serving.

From right to left in the image above, let’s start with the model source:
- The model source provides plugins and functionality to help you load models or in TF Serving terms servables from numerous locations (e.g GCS or AWS S3 bucket). Once a model is loaded, the next component — model loader — is notified.
- The model loaders provide the functionality to load models from a given source independent of model type, the data type of even the use case. In short, model loaders provide efficient functions to load and unload a model (servable) from the source.
- The model manager handles the full life cycle of a model. That is, it manages when model updates are made, which version of models to use for inference, the rules and policies for inference and so on.
- The Servable handler provides the necessary APIs and interfaces for communicating with TF serving. TF serving provides two important types of Servable handler–REST AND gRPC. You’ll learn the difference between these two in later parts of this tutorial.
For more in-depth details of the TS architecture, visit the official guide below:
In the next section, I’m going to take a little detour to quickly introduce Docker. This is an optional section, and I introduce it here because you’ll be using Docker when installing TF Serving.
Brief introduction to Docker and installation guide
Docker is a computer program that makes it easy for developers to package applications or software in a manner that is easily reproducible on another machine. Docker makes use of containers, which allows you to package an application along with its libraries and dependencies as a single package that can be deployed in another environment.
Docker is similar to virtual machines with just some minute difference, one of which is the fact that Docker uses the same Operating System whereas VMs uses different OS instances to isolate applications from one another.

While I’ll not be diving deep into Docker, I’ll walk you through installing it and pulling a simple hello world Docker image.
First, download and install Docker for your OS:
After downloading and install Docker via the respective installers, run the command below in a terminal/command prompt to confirm that Docker has been successfully installed:
docker run hello-world
which should output :
Unable to find image ‘hello-world:latest’ locally
latest: Pulling from library/hello-world
0e03bdcc26d7: Pull complete
Digest: sha256:4cf9c47f86df71d48364001ede3a4fcd85ae80ce02ebad74156906caff5378bc
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the “hello-world” image from the Docker Hub. (amd64)
3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.
If you get the output above, then Docker has been successfully installed on your system.
Installing Tensorflow Serving
Now that you have Docker properly installed, you’re going to use it to download TF Serving.
Note:
You can install Tensorflow Serving without Docker, but using Docker is recommended and is certainly the easiest.
In your terminal run the following command below:
docker pull tensorflow/serving

This takes some time, and when done, will download the Tensorflow Serving image from Docker Hub.
If you are running Docker on an instance with GPU, you can install the GPU version as well:
docker pull tensorflow/serving:latest-gpu
Congrats! Tensorflow Serving has been installed. In the next section, you will train and save a simple image classifier using TensorFlow Keras.
Building, training, and saving an Image classification model
In order to demonstrate model serving, you’re going to create a simple Image classifier for handwritten digits using Tensorflow. If you don’t have TensorFlow installed, follow this guide here.
Note:
This is not a model optimization tutorial, as such the focus is on simplicity. So you won’t be doing any extensive hyper-parameter tuning, and the model built may not be optimal.
The MNIST handwritten digit classification dataset is a very popular image classification task. It contains handwritten digits from humans and the task is to classify these digits into a number between 0 and 9. Because the dataset is very popular, Tensorflow comes prepackaged with it, as such, you can easily load it.

Below, I’ll walk you through loading the dataset, and then build a simple deep learning classifier.
Step 1: Create a new project directory and open it in your code editor. I call mine tf-server, and I have it open in VsCode.
Step 2: In the project folder, create a new script called model.py, and paste the code below:
import matplotlib.pyplot as plt
import time
from numpy import asarray
from numpy import unique
from numpy import argmax
from tensorflow.keras.datasets.mnist import load_data
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPool2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
#load MNIST dataset
(x_train, y_train), (x_test, y_test) = load_data()
print(f'Train: X={x_train.shape}, y={y_train.shape}')
print(f'Test: X={x_test.shape}, y={y_test.shape}')
# reshape data to have a single channel
x_train = x_train.reshape((x_train.shape[0], x_train.shape[1], x_train.shape[2], 1))
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], x_test.shape[2], 1))
# normalize pixel values
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
# set input image shape
input_shape = x_train.shape[1:]
# set number of classes
n_classes = len(unique(y_train))
# define model
model = Sequential()
model.add(Conv2D(64, (3,3), activation='relu', input_shape=input_shape))
model.add(MaxPool2D((2, 2)))
model.add(Conv2D(32, (3,3), activation='relu'))
model.add(MaxPool2D((2, 2)))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))
# define loss and optimizer
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(x_train, y_train, epochs=10, batch_size=128, verbose=1)
# evaluate the model
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print('Accuracy: %.3f' % acc)
#save model
ts = int(time.time())
file_path = f"./img_classifier/{ts}/"
model.save(filepath=file_path, save_format='tf')
The code above is pretty straight forward, first, you import the necessary packages you’ll be using and also load the MNIST dataset prepackaged with Tensorflow. Then, you reshaped the data to use a single channel (black & white) and then normalize by dividing with 1/255.0.
Next, you create a simple Convolutional Neural Network (CNN) with 9 classes at the output because you’re predicting for 10 classes (0–9). Then you compile the model by specifying an optimizer, a loss function, and a metric.
Next, you fit the model for 10 epochs using a batch size of 128. After fitting, you evaluate it on the test data, print the accuracy, and finally save the model.

The code for saving the model is based on timestamp. This is a good practice and is highly recommended.
You can inspect the saved model in the folder. It should be similar to the one shown below:
├── img_classifier
│ ├── 1600788643
│ │ ├── assets
│ │ ├── saved_model.pb
│ │ └── variables
Serving saved model with Tensorflow Serving
Once you have your model saved, and Tensorflow Serving correctly installed with Docker, you are going to serve it as an API Endpoint.
It is worth mentioning that Tensorflow Serving allows two types of API Endpoint — REST and gRPC.
- REST is a communication “protocol” used by web applications. It defines a communication style on how clients communicate with web services. REST clients communicate with the server using the standard HTTP methods like GET, POST, DELETE, etc. The payloads of the requests are mostly encoded in JSON format
- gRPC on the other hand is a communication protocol initially developed at Google. The standard data format used with gRPC is called the protocol buffer. gRPC provides low- latency communication and smaller payloads than REST and is preferred when working with extremely large files during inference.
In this tutorial, you’ll use a REST Endpoint, since it is easier to use and inspect. It should also be noted that Tensorflow Serving will provision both Endpoints when you run it, so you do not need to worry about extra configuration and setup.
Follow the steps below to serve your model:
First, in your project folder, open a terminal, and add the Docker command below:
docker run -p 8501:8501 --name tfserving_classifier
--mount type=bind,source=/Users/tf-server/img_classifier/,target=/models/img_classifier
-e MODEL_NAME=img_classifier -t tensorflow/serving
Let’s understand each argument:
- -p 8501:8501: This is the REST Endpoint port. Every prediction request will be made to this port. For instance, you can make a prediction request to http://localhost:8501.
- — name tfserving_classifier: This is a name given to the Docker container running TF Serving. It can be used to start and stop the container instance later.
- — mount type=bind,source=/Users/tf-server/img_classifier/,target=/models/img_classifier: The mount command simply copies the model from the specified path (/Users/tf-server/img_classifier/) into the Docker container (/models/img_classifier), so that TF Serving has access to it.
Note:
If you encounter the path error:
docker: Error response from daemon: invalid mount config for type “bind”: bind source path does not exist: /User/tf-server/img_classifier/.
Then specify the full path to the model folder. Remember, not the model itself, but the model folder.
- -e MODEL_NAME=img_classifier: The name of the model to run. This is the name you used to save your model.
- -t tensorflow/serving: The TF Serving Docker container to run.
Running the command above starts the Docker container and TF Serving exposes the gRPC (0.0.0.0:8500) and REST (localhost:8501) Endpoints.

Now that the Endpoint is up and running, you can make inference calls to it via an HTTP request. Let’s demonstrate this below.
Create a new script in your project folder called predict.py, and add the following lines of code to import some packages:
import matplotlib.pyplot as plt
import requests
import json
import numpy as np
from tensorflow.keras.datasets.mnist import load_data
The requests package is used to construct and send an HTTP call to a server, while the json package will be used to parse the data (image) before sending it.
Next, you’ll load the data and preprocess it:
#load MNIST dataset
(_, _), (x_test, y_test) = load_data()
# reshape data to have a single channel
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], x_test.shape[2], 1))
# normalize pixel values
x_test = x_test.astype('float32') / 255.0
Notice that we are concerned about only the test data here. You will load it, and perform the same preprocessing steps you dd during model training.
Next, define the REST Endpoint URL:
#server URL
url = 'http://localhost:8501/v1/models/img_classifier:predict'
The prediction URL is made up of a few important parts. A general structure may look like the one below:
http://{HOST}:{PORT}/v1/models/{MODEL_NAME}:{VERB}
- HOST: The domain name or IP address of your model server
- PORT: The server port for your URL. By default, TF Serving uses 8501 for REST Endpoint.
- MODEL_NAME: The name of the model you’re serving.
- VERB: The verb has to do with your model signature. You can specify one of predict, classify or regress.
Next, add a function to make a request to Endpoint:
def make_prediction(instances):
data = json.dumps({"signature_name": "serving_default", "instances": instances.tolist()})
headers = {"content-type": "application/json"}
json_response = requests.post(url, data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']
return predictions
In the prediction code above, first you define a JSON data payload. TF Serving expects data as JSON, and in the format of:
{“signature_name”: “<string>”,
“instances”: <value>}
The “signature_name” is optional and can be ignored.”Instances” on the other hand is the data/input/instance you want to predict on. You should pass this as a list.
After constructing the parameters, you send a request to the Endpoint and load the returned response.
To test this, you’ll make predictions on 4 test images as shown below:
Note:
To run the predict.py file, ensure the TF Serving container is still active, before running python predict.py
in a new terminal window.
predictions = make_prediction(x_test[0:4])
//output
[[1.55789715e-12, 1.01289466e-08, 1.07480628e-06, 1.951177e-08, 1.01430878e-10,
5.59054842e-12, 1.90570039e-17, 0.999998927, 4.16908175e-10, 5.94038907e-09],
[6.92498414e-09, 1.17453965e-07, 0.999999762, 5.34944755e-09, 2.81366846e-10,
1.96253143e-13, 9.2470593e-08, 3.83119664e-12, 5.33368405e-10, 1.53420621e-14],
[3.00994889e-11, 0.999996185, 4.14686845e-08, 3.98606517e-08, 3.23575978e-06,
1.82125728e-08, 2.17237588e-08, 1.60862257e-07, 2.42824342e-07, 4.56675897e-09],
[0.999992132, 5.11100086e-11, 2.94807769e-08, 1.22479553e-11, 1.47668822e-09,
4.50467552e-10, 7.61841738e-06, 2.56232635e-08, 6.94065747e-08, 2.13664606e-07]]
This returns a 4 by 10 array corresponding to the 4 images you predicted on, and a probability value for each class (0–9).
To get the actual predicted class, you can use the `np.argmax` function as shown below:
for pred in predictions:
print(np.argmax(pred))
//output
7
2
1
0
You can also check how correct the predictions are by comparing with the true values as shown below:
for i, pred in enumerate(predictions):
print(f"True Value: {y_test[i]}, Predicted Value: {np.argmax(pred)}")
//output
True Value: 7, Predicted Value: 7
True Value: 2, Predicted Value: 2
True Value: 1, Predicted Value: 1
True Value: 0, Predicted Value: 0
The complete code for predict.py is shown below:
import matplotlib.pyplot as plt
import requests
import base64
import json
import numpy as np
from tensorflow.keras.datasets.mnist import load_data
#load MNIST dataset
(_, _), (x_test, y_test) = load_data()
# reshape data to have a single channel
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], x_test.shape[2], 1))
# normalize pixel values
x_test = x_test.astype('float32') / 255.0
#server URL
url = 'http://localhost:8501/v1/models/img_classifier:predict'
def make_prediction(instances):
data = json.dumps({"signature_name": "serving_default", "instances": instances.tolist()})
headers = {"content-type": "application/json"}
json_response = requests.post(url, data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']
return predictions
predictions = make_prediction(x_test[0:4])
for i, pred in enumerate(predictions):
print(f"True Value: {y_test[i]}, Predicted Value: {np.argmax(pred)}")
And that’s it! You have been able to:
- save a trained model,
- start a TF serving server,
- and send prediction requests to it.
TF Serving handles all the model and API infrastructure for you so that you can focus on model optimization.
Note that TF serving will automatically load a new model, once it is available in the model folder. For instance, change some parameters of the model like epoch size, and then retrain the model.
Once training is done and you have saved the model, TF Serving automatically detects this new model, unloads the old one, and loads the newer version.

The automatic hot-swapping of models is very efficient, and can easily be built into ML CI/CD pipeline whereby you focus more on model optimization instead of model serving infrastructure.
Best practices of using Tensorflow Serving
- It is advisable and easier to use TF serving via Docker containers, as this can be easily integrated with existing systems. If you need more custom build or installation, you can build TF Serving from the source. Follow the guide here.
- When working with large datasets during inference, it is more efficient to use gRPC as your Endpoint. Also, see setting up TF Serving on Kubernetes.
- Path Errors may occur when loading models during the Docker run stage. You can fix this by specifying a full path instead of absolute paths.
- It is advisable to build TF Serving into TFX pipelines. So that a model is automatically vetted before it is served by TF Serving.
- Sometimes, the default port 8501 may be unavailable or in use by other system processes, you can easily change this to another port when running the Docker image.
Conclusion
In this tutorial, you have learned how to:
- Install Tensorflow Serving via Docker
- Train and save a Tensorflow image classifier
- Serve the saved model via REST Endpoint
- Make inference with the model via the TF Serving Endpoint
Armed with this knowledge, you can build efficient model pipelines for production environments that not only scale, but scale properly!
Link to the project code on Github