NLP is currently one of the most exciting areas of ML, as the advent of Transformers and large language models such as GPT and BERT have redefined what is possible within the field. However, much of the focus in blogs and the popular media is on the models themselves and not on highly important practical details such as how to deploy these models in production. This article seeks to bridge that gap and explain some best practices for NLP model deployments.
We will discuss many of the critical aspects of the model deployment process, such as:
- choosing a model framework,
- deciding on an API backend,
- creating a microservice using Flask,
- containerizing your model using tools such as Docker,
- monitoring deployments,
- and scaling up cloud infrastructure using tools such as Kubernetes and services such as AWS Lambda.
In doing so, we will walk through a small example of deploying a text classification model from start to finish and offer some thoughts on model deployment best practices.
Read also
Building MLOps Pipeline for NLP: Machine Translation Task [Tutorial]
Model training frameworks vs model deployment
Your choice of NLP framework will have an impact on how the model is deployed. Sklearn is a popular choice for simple classification models such as SVMs, Naive Bayes, or Logistic Regression, and it integrates well with Python backends. Spacy is also well recognized for its all-in-one language processing features such as sentence parsing, part of speech tagging, and named entity recognition. It is also a Python package.
Deep learning-based models are often written in the PyTorch library for Python, as its define-by-run autograd interface is ideal for building models which might create computational graphs in response to dynamic inputs, such as the parsing of variable-length sentences. Many popular libraries are also built on top of PyTorch such as HuggingFace Transformers, which is a respected go-to for working with pre-trained transformer models. Clearly, the Python ecosystem is extremely popular for ML and NLP; however, there are alternatives.
Pre-trained word embeddings such as FastText, GloVe, and Word2Vec can simply be read in from a text file and can be used with any backend language and framework. Tensorflow.js is an extension of Tensorflow that allows deep learning models to be written directly in Javascript and deployed in backends using Node.js. The Microsoft CNTK framework can be easily integrated in .NET and C# based backends. Similar machine learning packages can be found for many other languages and frameworks, although their quality varies. Still, Python is the de facto standard for creating and deploying machine learning and NLP models.
May be useful
How to Structure and Manage Natural Language Processing (NLP) Projects
Backend frameworks vs model deployment
Your choice of backend framework is critical to a successful model deployment. While any combination of language and framework can technically work, it’s often nice to be able to use a backend developed in the same language as your model. This makes it easy to just import your model into your backend system without having to serve requests between different interacting backend services or port between different systems. It also reduces the chance of introducing errors and keeps the backend code clean, free from clutter and unnecessary libraries.
The two major backend solutions within the Python ecosystem are Django and Flask. Flask is recommended for quickly prototyping model microservices as it makes it easy to get a simple server up and running in a few lines of code. However, if you are going to be building a production system, Django is more fully-featured and integrates the popular Django REST Framework for building complex, API-driven backends.
HuggingFace, a popular NLP library, also offers an easy way to deploy models via their Inference API. When you build a model using the HuggingFace library, you can then train it and upload it to their Model Hub. From there, they offer a scalable compute backend which serves models hosted in the hub. With just a few lines of code, and for the price of a few dollars per day, anyone can deploy secure, scalable NLP models built with the HuggingFace library.
Another great, NLP-specific deployment solution is Novetta’s AdaptNLP:
- They provide a variety of easy-to-use integrations for rapidly prototyping and deploying NLP models. For example, they have a series of methods that integrate training of different types of HuggingFace NLP models using FastAI callbacks and functionality, thereby speeding up both training and inference in deployment.
- They also provide ready-to-use REST API microservices, packaged as Docker containers, around a variety of HuggingFace model types such as question answering, token tagging, and sequence classification. These APIs have full-fledged Swagger UIs that provide a clean interface for testing the models.
Hands-on deployment of the NLP model
Now, let’s take a look at how a logistic regression text classifier can be deployed using Flask. We’ll be training the classifier to predict whether an email is “spam” or “ham”.
You can visit this Kaggle page and download the dataset. Then, run the following command to create a conda environment to host your Python and library installs for this tutorial.
conda create -n model-deploy python=3.9.7
Once the setup has finished, activate the environment by running:
conda activate model-deploy
Then, install our needed libraries by running:
pip install Flask scikit-learn
While you’re waiting, go ahead and take a look at the csv dataset that you downloaded. It has a header which specifies two fields, the “Category” (which will be our label), and the “Message” (which will be our model input).
Now, open your code editor and start typing. First, we’ll build the classification model. Since this post is intended to be a tutorial on deployment, we won’t walk through all the model details, but we provide its code below.
Making the required imports.
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
Creating the required functions.
def load_data(fpath):
# map ham -> 0, spam -> 1
cat_map = {
"ham": 0,
"spam": 1
}
tfidf = TfidfVectorizer()
msgs, y = [], []
filein = open(fpath, "r")
reader = csv.reader(filein)
for i, line in enumerate(reader):
if i == 0:
# skip over the header
continue
cat, msg = line
y.append(cat_map[cat])
msg = msg.strip() # remove newlines
msgs.append(msg)
X = tfidf.fit_transform(msgs)
return X, y, tfidf
def featurize(text, tfidf):
features = tfidf.transform(text)
return features
def train(X, y, model):
model.fit(X, y)
return model
def predict(X, model):
return model.predict(X)
clf = LogisticRegression()
X, y, tfidf = load_data('spamorham.csv')
train(X, y, clf)
Now let’s set up Flask and create the endpoints for our model serving microservice. We’ll first want to import Flask and create a simple app.
import model
import json
from flask import (
Flask,
request
)
app = Flask(__name__)
app.config["DEBUG"] = True
@app.route('/predict', methods=['POST'])
def predict():
args = request.json
X = model.featurize([args['text']], model.tfidf)
labels = model.predict(X, model.clf).tolist()
return json.dumps({'predictions': labels})
app.run()
As you can see, we constructed a Flask app and ran it in “DEBUG” mode to start with so that it will alert us if any errors arise. Our app has a single `route` defined with the endpoint `/predict`. This is a POST endpoint that takes in a string of text and classifies it as either “ham” or “spam”.
We access the post arguments as `request.json`. There is a single argument, ‘text’ which specifies the text of the email message we’d like to classify. For efficiency purposes, we could also rewrite this to classify multiple pieces of text at once. You can try adding this feature if you want to :).
The predict function is simple. It takes in an email message, converts it into a TF-IDF feature vector, then runs the trained logistic regression classifier to predict whether it is spam or ham.
Let’s now test out the app to make sure it works! To do so, run the following at your command line:
python deploy.py
This will spin up the Flask server on http://localhost:5000. Now, open up a separate Python prompt and run the following.
res = requests.post('http://127.0.0.1:5000/predict', json={"text": "You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)"})
You can see that we’re making a POST request to the `/predict` endpoint with a json field that specifies the email message under the argument “text”. This is clearly a spam message. Let’s see what our model returns. To get the response from the API, just run `res.json()`. You should see the following result:
{‘predictions’: [1]}
You can also test your request by sending it in POSTMAN as shown here:
All you need to do is type in your URL, set the request type to POST, and put the JSON for your request in the “Body” field of the request. Then you should see your predictions returned as in the lower window.
As you can see, the model returned a prediction with value 1, meaning it classified the message as spam. Huzzah! And there you have it, the basics of deploying an NLP model with Flask.
In the following sections, we’ll discuss more advanced concepts, such as how to scale a deployment up to handle larger request loads.
Containerization in the context of model deployment
A crucial part of any model deployment is containerization. Tools such as Docker allow you to package your code within a container which is basically a virtual runtime that contains system tools, program installs, libraries, and anything else that’s needed to run your code.
Containerizing your services makes them more modular and allows them to run on any system that has Docker installed. With containers, your code should always just work without any pre-configuration or messy install steps. Containers also make it easy to handle large-scale deployments of your services across many machines using orchestration tools such as Docker-Compose and Kubernetes, which we will cover later in this tutorial.
Here we walk through a Dockerfile for our text classification microservice. To get this container to work, you’ll need to create a `requirements.txt` file which specifies the packages needed to run our microservice.
You can create it by running this command in your terminal in that directory.
pip freeze > requirements.txt
We’ll also need to make one change to our Flask script to get it working inside Docker. Just change the line that says `app.run()` to `app.run(host=’0.0.0.0’)`.
Now onto the Dockerfile.
FROM python:3.9.7-slim
COPY requirements.txt /app/requirements.txt
RUN cd /app &&
pip install -r requirements.txt
ADD . /app
WORKDIR /app
ENTRYPOINT [“python”, “deploy.py”]
Let’s try to understand what these lines mean.
- The first line, `FROM python:3.9.7-slim`, specifies the base image for our container. You can think of the image that our image inherits libraries, system configurations, and other elements from. The base image we use provides a minimal installation of Python v3.9.7.
- The next line copies our `requirements.txt` file into the Docker image under the `/app` directory. `/app` will house our application files and related resources.
- In the following line, we cd into /app and install our needed python libraries by running the command `pip install -r requirements.txt`.
- Now, we add the contents of our current build directory into the /app folder with `ADD . /app`. This will copy over all of our Flask and model scripts.
- Finally, we set the container’s working directory to /app by running `WORKDIR /app`. We then specify the ENTRYPOINT, which is the command that the container will run when it is launched. We set it to run `python deploy.py`, which launches our Flask server.
To build your Docker image, run `docker build -t spam-or-ham-deploy .` from the directory that contains your Dockerfile. Assuming everything is working correctly, you should get a readout of the build process that looks something like this:

We can also now see our container image listed in Docker Desktop:

Next, to run your Docker container containing the Flask deployment script, type:
docker run -p 5000:5000 -t spam-or-ham-deploy
The `-p 5000:5000` flag publishes port 5000 in your container to port 5000 in your host. This makes your container’s service accessible from the ports on your machine. Now that the container is running, we can view some of its stats in Docker Desktop:

We can also try running the same request again in POSTMAN just as before:

Cloud deployment
So far, our API has just been designed to handle moderate request loads. If you’re deploying a large-scale service to millions of customers, you will need to make many adjustments to how you deploy the model.
Kubernetes
Kubernetes is a tool for orchestrating containers across large deployments. With Kubernetes, you can effortlessly deploy multiple containers across many machines and monitor all of these deployments. Learning to work with Kubernetes is an essential skill for scaling to larger deployments.
Check also
Kubernetes vs Docker: What You Should Know as a Machine Learning Engineer
To run Kubernetes locally, you will have to install minikube.
Once you’ve done that, run `minikube start` at your terminal. It will take a few minutes to download Kubernetes and the base image. You will get a readout like this:

Next we’ll want to create a deployment by running:
kubectl create deployment hello-minikube --image=spam-or-ham-deploy
We then want to expose our deployment using:
kubectl expose deployment hello-minikube --type=NodePort --port=8080
If we then run `kubectl get services hello-minikube`, it will display some useful information about our service:

We can then launch the service in a browser by running `minikube service hello-minikube`

You can also view your service in the dashboard by running `minikube dashboard`.

For more information, view the Kubernetes getting started docs.
AWS Lambda
If you prefer a more automated solution, elastic inference services such as AWS Lambda can be quite useful. These are event-driven services, meaning they will automatically spin up and manage compute resources in response to the request load that they’re experiencing. All you need to do is define Lambda functions that run your model inference code, and AWS Lambda will handle the deployment and scaling process for you.
You can learn some more about deploying models on AWS here.
TorchServe
If you’re working with deep learning NLP models such as Transformers, the TorchServe library for PyTorch is a great resource for scaling and managing your PyTorch deployments. It has a REST API as well as a gRPC API for defining remote procedure calls. It also includes helpful tools for handling logging, tracking metrics, and monitoring a deployment.
Challenges in NLP model deployment
1. A crucial aspect of NLP model deployment is ensuring a proper MLOps workflow. MLOps tools allow you to ensure model reproducibility by tracking the steps involved in the training and inference of a model. This includes versioning data, code, hyperparameters, and validation metrics.
MLOps tools such as neptune.ai, MLFlow, etc. provide APIs for tracking and logging parameters (such as attention coefficients) and metrics (such as NLP model perplexity), code versions (really, anything managed via Git), model artifacts, and training runs. Monitoring the training and deployment of your NLP model with such tools is crucial to prevent model drift and to ensure that the model continues to accurately reflect the entirety of the data in the system.
May be useful
Check what metadata you can log, track and compare in Neptune
2. Another challenge is that NLP models might need to be periodically retrained. For example, consider the use case of a translation model deployed in production. As a business adds more customers in different countries, it might want to add more language translation pairs to the model. In such cases, it’s important to ensure that adding new training data and retraining does not degrade the existing model quality. For this reason, continuous model monitoring, as described above, of various NLP metrics is really important.
3. NLP models might also need to be trained incrementally and online in production. For example, if deploying a model for emotion detection from raw text, you might suddenly get some new data corresponding to the “sarcasm” emotion.
Usually, in such cases, you would not want to retrain the entire model from scratch, particularly if the model is large. Luckily, there are many algorithms and libraries that can be used to deploy streaming NLP models in production. For example, scikit-multiflow implements classification algorithms such as Hoeffding Trees which are designed to be trained incrementally in sublinear time.
See also
Deploying Large NLP Models: Infrastructure Cost Optimization
Conclusion
There are a number of considerations that must be taken into account when deploying an NLP model such as the scale of the deployment, the type of NLP model being deployed, inference latency and server load, and more.
Best practices for deploying NLP models include using a Python backend such as Django or Flask, containerization with Docker, MLOps management with MLFlow or Kubeflow, and scaling with services such as AWS Lambda or Kubernetes.
For those who don’t want the hassle of handling large-scale deployments themselves, there are easy-to-use paid services such as HuggingFace’s Inference API which handle the deployments for you. While it takes some time to get up to speed on how to optimally deploy NLP models, it’s an investment well worth making as it ensures that you can make your model available to the rest of the world!
References
- https://scikit-multiflow.github.io/
- https://www.tutorialspoint.com/what-is-hoeffding-tree-algorithm
- https://huggingface.co/docs/transformers/index
- https://huggingface.co/inference-api
- https://scikit-learn.org/0.21/documentation.html
- https://docs.aws.amazon.com/lambda/index.html
- https://pytorch.org/serve/
- https://minikube.sigs.k8s.io/docs/start/
- https://github.com/Novetta/adaptnlp
- https://docs.docker.com/
- https://flask.palletsprojects.com/en/2.1.x/