MLOps Blog

MLOps Tools for NLP Projects

11 min
5th September, 2023

Machine learning chatbots, summarizing apps, Siri, Alexa – these are just a few cool Natural Language Processing (NLP) projects which are already adopted at mass scale. Have you ever wondered how they’re managed, continuously improved, and maintained? This is exactly the question that we’re going to answer in this article.

For example, Google’s autocorrect gets better every time, but not because they came up with a super good model that doesn’t need any maintenance. It gets better every time because there’s a pipeline, put in place early on for automating and improving the model by performing all ML tasks over and over again when it gets new data. It’s an example of MLOps at its finest.

In this article, I’ll tell you about various MLOps tools you can use for NLP projects. This includes cool open-source MLOps platforms, along with some code to help you get started. I’ll also do a comparison of all the tools, to help you navigate and choose the best tool for any framework you want to use.

Here’s what we’re going to talk about:

Read also

MLOps: 10 Best Practices You Should Know

Here are the assumptions I made when writing the article, just so we’re on the same page:

  • You understand what NLP is. You don’t need to know much, just a bit of the basic and some process is good enough. 
  • You’re familiar with the process involved in building machine learning projects. Again, you don’t need to know too much. You should have built at least a machine learning project before, just so you know the terms I’ll be using.
  • You’re open-minded and ready to learn! 

If you’re an MLOps expert, you can skip the introduction and go straight to the tools.

What is MLOps?

Data changes over time, which makes machine learning models stale. ML models learn patterns in data, but these patterns change as the trends and behaviors change.

We can’t prevent data from always changing, but we can keep our model updated with the new trends and changes. To do this, we need an automated pipeline. This automated process is known as MLOps.

MLOps is a set of practices for collaboration and communication between data scientists and operations professionals.

Please note that MLOps is not fully automated, at least not yet. You still have to do some things manually, but it’s incomparably easier compared to having no workflow at all.

How does MLOps work?

MLOps, or Machine Learning Operations, is different from DevOps. 

DevOps is a popular practice in developing and operating large-scale software systems. It has two concepts in software system development:

A typical DevOps cycle is:

  • Code,
  • Test,
  • Deploy,
  • Monitor.

In ML projects, there are a lot of other processes like data collection and processing, feature engineering, training, and evaluating ML models, and DevOps can’t handle all of this. 

MLOps Lifecycle | Source

In MLOps, you have:

  • data coming into the system which is usually the entry, 
  • codes to preprocess the data and select useful features, 
  • codes to train the model and evaluate it, 
  • codes to test and validate it, 
  • codes to deploy, 
  • and so on.

To deploy your model to production, you need to push it through a CI/CD pipeline. 

Once it’s in production:

  • You need to always check performance and make sure it’s reliable,
  • You need an automated alert or triggering system to inform you of issues and to make sure the changes fix the issues raised.
MLOps automated pipeline
MLOps automated pipeline | Source

Why do we need MLOps?

It doesn’t matter what kind of solutions you’re trying to deploy, MLOps is fundamental to the success of your project.

MLOps does not only help to collaborate and integrate ML into technologies, it helps data scientists do what they do best, develop models. MLOps automates retraining, testing, and deployment which were manually done by data scientists.

Machine learning helps deploy solutions that unlock previously untapped sources of revenue, save time, and reduce cost by creating more efficient workflows, leveraging data analytics for decision-making, and improving customer experience. These goals are hard to accomplish without a solid framework like MLOps to follow.

How to choose a good MLOps tool

Choosing a suitable MLOps tool for your NLP project depends on the tool of your solution. 

Your choice depends on your project needs, maturity, and scale of deployment. Your project must be properly structured (Cookie Cutter is a good project structuring tool that will help you do that).

Manasi Vartak, founder and CEO of Verta, pointed out some questions you should ask yourself before selecting any MLOps tool:

  • It should be data scientist-friendly, not restricting your data science teams to work on specific tools and frameworks.
  • It should be easy to install, easy to set up, and easy to customize.
  • It should integrate freely with your existing platform.
  • It should be able to reproduce results; reproducibility is critical whether you are collaborating with team members, debugging a production failure, or iterating an existing model. 
  • It should scale well; choose a platform that meets your current needs and can scale for the future for both real-time and batch workloads, serving high-throughput scenarios, scaling automatically with the increasing traffic, with easy cost management and safe deployment and release practices.

Best open-source MLOps tools for your NLP projects

Every MLOps tool has its own tool. The open-source platforms listed below are specific to NLP projects and are rated by the number of Github stars they have. Some of the commercialized platforms are specifically for NLP projects, but others can generally be used for any ML project. 

AdaptNLP (329 Github stars)

It’s a high-level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end-to-end tasks. It was built on top of Zalando Research’s Flair and Hugging Face’s Transformers library. 

AdaptNLP provides Machine Learning researchers and scientists a modular and adaptive approach to a variety of NLP tasks with an easy API for training, inference, and deploying NLP-based microservices. You can deploy your Adapt-NLP models using Fast-api, locally or using docker.

AdaptNLP features:

  • The API is unified for NLP tasks with SOTA pretrained models. You can use it with Flair and Transformer models.
  • Provides an interface for training and fine-tuning your models.
  • Easily and instantly deploy your NLP model with FastAPI framework.
  • You can easily build and run AdaptNLP containers on GPUs using Docker.

Installation Requirement for Linux/Mac:

I’ll advise that you install it in a new virtual environment to prevent dependency clustering issues. If you have Python version 3.7 installed, you’ll need to install the latest stable version of Pytorch(v.1.7) and if you have Python version 3.6, you’ll have to downgrade your Pytorch to a version <=1.6. 

Installation Requirement for Windows:

If you don’t have Pytorch already installed, you’ll have to install it manually from Pytorch.

Using pip,

pip install adaptnlp

or if you want to contribute to the development, 

pip install adaptnlp[dev]

Tutorials:

AutoGulon (3.5k Github stars)

AutoGluon is simply AutoML for text, image, and tabular data. It enables you to easily extend AutoML to areas like deep learning, stack ensembling, and other real-world applications. It automates machine learning tasks and gives your model strong predictive performance in your applications.

In just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models on text, image, and tabular data. Currently, it provides support for only Linux and MacOS users.

AutoGulon features:

  • Create a quick prototype of your deep learning and ML solutions with just a few lines of code.
  • Use state-of-the-art techniques automatically without having expert knowledge.
  • You can perform data preprocessing, tool search, model selection/ensembling, and hyperparameter tuning automatically.
  • AutoGulon is totally customizable for your use case.

Installation:

It requires you to have Python 3.6, 3.7, or 3.8. Currently, it supports only Linux and MacOS. Depending on your system, you can either download the CPU version or GPU version.

Using pip:

For MacOS:

python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel
python3 -m pip install -U "mxnet<2.0.0"
python3 -m pip install autogluon
  • Pip install for GPU

Currently unavailable

For Linux:

  • Pip install for CPU
python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel
python3 -m pip install -U "mxnet<2.0.0"
python3 -m pip install autogluon
  • Pip install for GPU
python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel

# Here we assume CUDA 10.1 is installed.  You should change the number
# according to your own CUDA version (e.g. mxnet_cu100 for CUDA 10.0).
python3 -m pip install -U "mxnet_cu101<2.0.0"
python3 -m pip install autogluon

Tutorial:

GluonNLP (2.3k github stars )

It’s a framework that supports NLP processes such as loading text data, preprocessing text data, and training NLP models. It’s available on Linux and MACOS. You can also convert your other forms of NLP models into GulonNLP. A few examples of such models you can convert include BERT, ALBERT, ELECTRA, MobileBERT, RoBERTa, XLM-R, BART, GPT-2, and T5.

GulonNLP features: 

  • Easy to use Text Processing Tools and Modular APIs
  • Pretrained Model Zoo
  • Write Models with Numpy like APIs
  • Fast Inference via Apache TVM (incubating) (Experimental)
  • AWS Integration with SageMaker

Installation

Before you start the installation, make sure you have the MXNet 2 release on your system. Just in case you don’t, you can install it from your terminal. Choose one out the following options:

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0a"

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0a"

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0a"

Now, you can go ahead to install GulonNLP. Open your terminal and type:

python3 -m pip install -U -e

You can also install all the extra requirements by typing: 

python3 -m pip install -U -e ."[extras]"

If you come across any issue while installing related to user permissions, please refer to this guide.

Tutorials:

Kashgari (2.1k github stars)

Powerful NLP transfer learning framework that you can use to build state-of-the-art models in 5 minutes for Named Entity Recognition(NER), part-of-speech tagging(POS), and model classification. It can be used by beginners, people in academics, and researchers. 

Kashgari features:

  • Easy to customize, well documented, and straightforward.
  • Kashgari allows you to use state-art-of-the-art models for your Natural Language Processing projects.
  • It allows you to build multi-label classification models, create custom models, and so much more. Learn more here
  • Allows you to adjust your model’s hyperparameters, use custom optimizers and callbacks, create custom models, and others.
  • Kashgari has built-in pretrained models which makes transfer learning very easy.
  • Kashagri is simple, fast, and scalable
  • You can export your models and directly deploy them to the cloud using tensorflow serving.

Installation

Kashgari requires you to have Python 3.6+ installed on your system.

Using pip

  • For TensorFlow 2.x:
pip install 'kashgari>=2.0.0
  • For TensorFlow 1.14+:
pip install 'kashgari>=2.0.0
  • For Keras:
pip install 'kashgari<1.0.0

Tutorials:

LexNLP (460 Github stars)

LexNLP developed by LexPredict is a Python library for working with real unstructured legal text, including contracts, policies, procedures, and other types of materials, classifiers and clause type, tools for building new clustering and classification methods, hundreds of unit tests of real legal documents.

Features:

  • It provides pre-trained models for segmentation, word embedding and topic models, classifiers for document and clause type.
  • Fact extraction.
  • Tools for building new clustering and classification methods.

Installation:

Requires you have installed Python 3.6

pip install lexnlp

Tutorials:

Tensorflow Text (770 Github stars)

TensorFlow Text provides a collection of text-related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models and includes other features useful for sequence modeling not provided by core TensorFlow.

The benefit of using these ops in your text preprocessing is that they are done in the TensorFlow graph. You don’t need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

Tensorflow Text features:

  • Facilitates a large toolkit for working with text
  • Allows integration with a large suite of Tensorflow tools to support projects from problem definition through training, evaluation, and launch
  • Reduces complexity at serving time and prevents training-serving skew

Installation:

Using Pip

Please note: When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding minor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).

pip install -U tensorflow-text==<version>

Installing from source

Please note that TF Text needs to be built in the same environment as TensorFlow. Thus, if you manually build TF Text, it is highly recommended that you also build TensorFlow.

If building on MacOS, you must have coreutils installed. It is probably easiest to do with Homebrew.

Build and install TensorFlow.

  • Clone the TF Text repo: git clone https://github.com/tensorflow/text.git
  • Run the build script to create a pip package: ./oss_scripts/run_build.sh .After this step, there should be a *.whl file in current directory. File name similar to tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl.
  • Install the package to environment: pip install ./tensorflow_text-*-*-*-os_platform.whl

Tutorials:

Text preprocessing

Text Classification

Text Generation

Snorkel (4.7k GitHub stars)

Data labeling tool, you can label, build, and manage training data programmatically. The first component of a Snorkel pipeline includes labeling functions, which are designed to be weak heuristic functions that predict a label given unlabelled data.

Features:

  • It supports Tensorflow/Keras, Pytorch, Spark, Dask, and Scikit-Learn.
  • It provides APIs for labeling, analysis, preprocessing, slicing, mapping, utils, and classification.

Installation:

Snorkel requires Python 3.6 or later. 

Using pip (Recommended)

pip install snorkel

Using conda

conda install snorkel -c conda-forge

Please note: If you’re using Windows, it’s highly recommended using Docker (tutorial example) or the Linux subsystem.

Tutorials:

Tensorflow Lingvo (2.3k Github stars)

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models. 

Tensorflow Lingvo features:

  • Lingvo supports natural language processing (NLP) tasks but it is also applicable to models used for tasks such as image segmentation and point cloud classification.
  • Lingvo can be used to train the “on production scale” datasets.
  • Lingvo provides additional support for synchronous and asynchronous distributed training.
  • Quantization support has been built directly into the Lingvo framework.

Installation:

Using pip:

pip3 install lingvo

Installing from sources:

Check if you’ve met the following prerequisites 

  • TensorFlow 2.5 installed on your system
  • C++ compiler (only g++ 7.3 is officially supported)
  • The bazel build system.

Refer to docker/dev.dockerfile for a set of working requirements.

Now, git clone the repository, then use bazel to build and run targets directly. The python -m module commands in the codelab need to be mapped onto bazel run commands.

Using docker:

Docker configurations are available for both situations. Instructions can be found in the comments on the top of each file.

lib.dockerfile has the Lingvo pip package preinstalled.

dev.dockerfile can be used to build Lingvo from sources.

Tutorial:

SpaCy (21k Github stars )

spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research and was designed from day one to be used in real products.

spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification, and more, multi-task learning with pre-trained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment, and workflow management.

Features:

  • Support for custom models in PyTorch, TensorFlow, and other frameworks.
  • Support for 60+ languages.
  • Support for pre-trained word vectors and embeddings.
  • Easy model packaging, deployment, and workflow management.
  • Linguistically-motivated tokenization.

Installation:

It supports macOS / OS X , Linux , and Windows (Cygwin, MinGW, Visual Studio). You also need to have Python 3.6+ version (only 64 bit) installed on your system.

Using pip

Before you continue with the installation, make sure that your pip, setuptools, and wheel are up to date.

pip install -U pip setuptools wheel
pip install spacy

Using conda

conda install -c conda-forge spacy

Tutorials:

Flair (11k GitHub stars)

Flair is a simple framework for state-of-the-art NLP. It allows you to use state-of-the-art models for your NLP tasks, such as Named Entity Recognition (NER), part-of-speech tagging (POS), sense disambiguation, and classification. It provides special support for biomedical data and also supports a rapidly growing number of languages.

Flair features:

  • It’s entirely built on Pytorch and so you can easily build and train your Flair models.
  • State-of-the-art NLP models that you can use for your text.
  • Allows you to combine different words and document embeddings with simple interfaces.

Installation:

It requires you to have Pytorch 1.5+ and currently supports Python 3.6. Here is how for Ubuntu 16.04

pip install flair

Tutorials:

Open-source MLOps tools for your NLP projects – comparison

Github stars
Windows
Linux
MacOS
Tensorflow
Pytorcht
Other frameworks
Data labelling
Data preprocessing
Model development
Model deployment

Adapt NLP

329

Flair

11k

spaCy

21k

Tensorflow lingvo

2.3k

Snorkel

4.7k

Tensorflow text

770

LexNLP

460

Kashgari

2.1k

GulonNLP

2.3k

AutoGulon

3.5k

Best MLOps as a service tools for NLP projects

Neu.ro

Neuro MLOps platform provides complete solution and management of the infrastructure and processes you need for successful ML development at scale. It provides the complete MLOps lifecycle which includes data collection, model development, model training, experiment tracking, deployment and monitoring. Neu.ro provides management of the infrastructure and processes for successful ML development at scale.

Setup

Installation

Advisable to create a new virtual environment first. It requires you to have Python 3.7 installed.

pip install -U neuromation

Or 

<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install -U neuromation
</pre>

How to:

  • Sign up at neu.ro
  • Upload data either with webUI or CLI
  • Setup development environment (allows you to use GPU)
  • Train model or download a pretrained model
  • Run notebook(Jupyter)

Check out this ML Cookbook to help you get started with an NLP project.

AutoNLP

AutoNLP provides an automatic way to train state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem, and deploy them in a scalable environment automatically. It’s an automated way to train, evaluate, and deploy state-of-the-art NLP models for different tasks. It automatically fine-tunes a working model for deployment based on the dataset that you provide.

Setup

Installation:

To use pip:

pip install -U autonlp

Please note: you need to install git lfs to use the cli

How to:

  •  Sign in to your account
  •  Create a new model
  •  Upload your dataset
  •  Train your autonlp model
  •  Track model progress
  •  Make predictions
  •  Deploy your model

Check out the AutoNLP documentation for your specific use case.

neptune.ai

Neptune tracks machine learning experiments, stores your model’s metadata (log metrics, performance charts, video, audio, text, record data exploration), provides a model registry where you can version, stores and lets you query your models anytime, and provides an effective way for your team to collaborate. Neptune lets you customize the UI and manage users in an on-prem environment or on the cloud.

Setup

Installation

pip install neptune

How to log your project metadata,

  • Create a Neptune account
  • Create a new project in Neptune

In your code editor, 

  • Initialize a run with your API token and log the model’s metadata you want to.
  • Run your codes and your project on Neptune will be automatically updated!

Checkout Neptune docs to explore more and run your experiments risk free!

Get started with Neptune
Neptune dashboard | See in the app

DataRobot

DataRobot which has now acquired Algorithmia is a platform that automates the end-to-end process of building, deploying, and maintaining machine learning (ML) and artificial intelligence (AI) at scale. It’s a no-code app builder, and a platform where you can deploy, monitor, manage, and govern all your models in production, regardless of how they were created or when and where they were deployed.

Setup

Installation

  • It currently supports python 2.7 and >=3.4
pip3 install datarobot
  • With Python 3.6+,
pip3 install requests requests-toolbelt

How to create a new project:

  • Sign in to your account
  • Install dependencies
  • Load and Profile your data
  • Start modelling
  • Review and interpret model
  • Deploy model
  • Choose an application

Check this doc for a proper walkthrough on how to use these steps.

Read also

Best DataRobot Alternatives for Model Registry

AWS MLOps Frameworks

It helps you streamline and enforce tool best practices for productionizing your machine learning models. It’s an extendable framework that provides a standard interface for managing ML pipelines for AWS ML services and third-party services. The solution template lets you upload your trained models, configure the orchestration of the pipeline, and monitor pipeline operations. It allows you to leverage a preconfigured ML pipeline and also automatically deploy a trained model with an inference endpoint.

How to setup a new project:

  • Sign in to your AWS account
  • Create a new SageMaker Studio
  • Create a new project
  • Select an MLOps architecture (development, evaluation,or deployment) you want.
  • Add data to AWS S3 bucket
  • Create pipeline and training files.

Check out this docs on how to set up a new project. You can also check out this tutorial on how to create a simple project.

Azure Machine Learning MLOps

Azure MLOps allows you to experiment, develop, and deploy models into production with end-to-end lineage tracking. It allows you to create reproducible ML pipelines, reusable software environments, deploy models from anywhere, govern the ML lifecycle, and closely monitor models in production for any issues. It allows you to automate the end-to-end ML lifecycle with pipelines which lets you update models, test new models, and continuously deploy new ML Models.

Setup

Installation

You need to install the Azure CLI 

How to:

  • Sign in to Azure devops
  • Create a new project
  • Import the project repository
  • Setup project environment
  • Create a pipeline
  • Train and deploy model
  • Set up continuous integration pipeline 

Check out this doc on how to go about these processes

Vertex AI | Google Cloud AI Platform

Vertex AI is a machine learning platform where you can access all Google Cloud services in one place to deploy and maintain AI models. It brings together the Google Cloud services for building ML under one, unified UI and API. You can use Vertex AI to easily train and compare models using AutoML or your custom code, with all your models stored in one central model repository. 

Setup

Installation

You can either use Google Cloud console and Cloud shell or you install Cloud SDK to your system.

How to create a new project (Using cloud shell):

  • Sign in to your account
  • Create a new project(Ensure billing is enabled for your account)
  • Activate cloud shell
  • Create a storage bucket
  • Train your model
  • Deploy to google cloud

Check out this doc for a walkthrough on how to follow these steps.

Check also

The Best Vertex ML Metadata Alternatives

MLOps as a service tools for NLP projects – comparison

 
Data collection and management
Data preparation and feature engineering
Model training and deployment
Model monitoring and experiment tracking
ML metadata store
Model registry & management

AutoNLP

No

No

Yes

No

No

No

Azure MLOps

No

No

Yes

Yes

Yes

Yes

AWS MLOps

No

No

Yes

Yes

No

No

DataRobot

No

No

Yes

Yes

No

Yes

Neptune

No

No

No

Yes

Yes

Yes

Neu.ro

Yes

No

Yes

Yes

No

No

Vertex AI

No

Yes

Yes

Yes

Yes

Yes

Conclusion

I’ve talked about why you need MLOps and how you can choose a good tool for your project. I also listed some NLP MLOps tools and highlighted some cool features about them. Not sure which tool to try out? Check the comparison table I made to see which best fits your project. I hope you try out some of the listed tools and do let me know what you think. Thanks for reading!

Additional references

Was the article useful?

Thank you for your feedback!