Tools for model serving in machine learning can provide you with solutions to many of the data engineers and devops concerns. They have many functionalities that make it easier to manage your models.
You can use them during the entire lifecycle of your ML project, beginning with building a trained model, to deploying, monitoring, providing easy accessibility, and production. They will automate and optimize your work, but also ensure there are no errors, make it easy to collaborate with others, and track changes in real-time.
Let’s take a look at the best tools that can help you in model serving!
Neptune is a lightweight experiment management and collaboration tool. It is very flexible, works with many other frameworks, and thanks to its stable user interface, it enables great scalability.
It’s a robust software that can store, retrieve, and analyze a large amount of data. Neptune has all the tools for efficient team collaboration and project supervision.
- Fast and intuitive UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
- Provides user and organization management with different organization, projects, and user roles
- You can use a hosted app to avoid all the hassle of maintaining yet another tool (or have it deployed on your on-prem infrastructure)
- Your team can track experiments which are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster)
- Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images).
BentoML is a framework for serving, managing, and deploying machine learning models. It’s aim is to bridge the gap between Data Science and DevOps, and enable teams to deliver prediction services in a fast, repeatable, and scalable way.
It’s a tool that helps to build and ship prediction services, instead of uploading pickled model files or Protobuf files to a server.
Here’s a summary of BentoML:
- Package models trained with any ML frameworks and reproduce them for model serving in production
- Deploy anywhere for online API serving or offline batch serving
- High-Performance API model server with adaptive micro-batching support
- Works as a central hub for managing models and deployment process via Web UI and APIs
- Modular and flexible design allows you to adapt the tool to your infrastructure
Cortex is an open-source platform for deploying, managing, and scaling machine learning models. It’s a multi framework tool that lets you deploy all types of models.
Cortex is built on top of Kubernetes to support large-scale machine learning workloads.
Cortex – summary:
- Automatically scale APIs to handle production workloads
- Run inference on any AWS instance type
- Deploy multiple models in a single API and update deployed APIs without downtime
- Monitor API performance and prediction results
TensorFlow Serving is a flexible system for machine learning models, designed for production environments. It deals with the inference aspect of machine learning.
It takes models after training and manages their lifetimes to provide you with versioned access via a high-performance, reference-counted lookup table.
Here are some of the most important features:
- Can serve multiple models, or multiple versions of the same model at the same time
- Exposes both gRPC and HTTP inference endpoints
- Allows deployment of new model versions without changing your code
- Lets you flexibly test experimental models
- Its efficient, low-overhead implementation adds minimal latency to inference time
- Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations, and non-Tensorflow-based machine learning models
TorchServe is a flexible and easy to use tool for serving PyTorch models. It’s an open-source framework that makes it easy to deploy trained PyTorch models performantly at scale without having to write custom code. TorchServe delivers lightweight serving with low latency, so you can deploy your models for high-performance inference.
TorchServe is experimental and may still undergo some changes, but anyway, it offers some interesting functionalities.
- Multi-model serving
- Model versioning for A/B testing
- Metrics for monitoring
- RESTful endpoints for application integration
- Supports any machine learning environment, including Amazon SageMaker, Kubernetes, Amazon EKS, and Amazon EC2
- TorchServe can be used for many types of inference in production settings
- Provides an easy-to-use command-line interface
KFServing provides a Kubernetes Custom Resource Definition (CRD) for serving machine learning models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.
The tool provides a serverless machine learning inference solution that allows a consistent and
simple interface to deploy your models.
Main features of KFServing:
- Provides a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability
- Customizable InferenceService to add your resource requests for CPU, GPU, TPU and memory requests and limits
- Batching individual model inference requests
- Traffic management
- Scale to and from Zero
- Revision management
- Request/Response logging
- Scalable Multi Model Serving
Multi Model Server (MMS) is a flexible and easy to use tool for serving deep learning models trained using any ML/DL framework. The tool can be used for many types of inference in production settings. It provides an easy-to-use command line interface and utilizes REST-based APIs handle state prediction requests.
You can use the MMS Server CLI, or the pre-configured Docker images, to start a service that sets up HTTP endpoints to handle model inference requests.
- Advanced configurations allow to deep customize MMS’s behavior
- Ability to develop custom inference services
- Housekeeping unit tests for MMS
- JMeter to run MMS through the paces and collect benchmark data
- Multi model server benchmarking
- Model serving with Amazon Elastic Inference
- ONNX model export feature supports different models of deep learning frameworks
Triton Inference Server provides an optimized cloud and edge inferencing solution. It’s optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server.
For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.
Key features of Triton:
- Supports multiple deep-learning frameworks (TensorRT, TensorFlow GraphDef, TensorFlow SavedModel, ONNX, and PyTorch TorchScript)
- Simultaneous model execution on the same GPU or on multiple GPUs
- Dynamic batching
- Extensible backends
- Supports model ensemble
- Metrics in Prometheus data format indicating GPU utilization, server throughput, and server latency
ForestFlow is an LF AI Foundation incubation project licensed under the Apache 2.0 license.
It is a scalable policy-based cloud-native machine learning model server for easily deploying and managing ML models.
It provides data scientists a simple means to deploy models to a production system with minimal friction accelerating the development to the production value proposition.
Here are ForestFlow main features:
- Can be run as a single instance (laptop or server) or deployed as a cluster of nodes that work together and automatically manage and distribute work.
- Offers native Kubernetes integration for easily deploying on Kubernetes clusters with little configuration
- Allows for model deployment in Shadow Mode
- Automatically scales down (hydrates) models and resources when not in use and automatically re-hydrates models back into memory to keep it efficient
- Allows to deploy models for multiple use-cases and chose between different routing policies to direct inference traffic between model variants serving each use-case
DeepDetect is a deep learning API and server written in C++11, along with a pure Web Platform for training and managing models.
DeepDetect aims at making the state of the art deep learning easy to work with and integrate into existing applications. It has support for backend machine learning libraries Caffe, Caffe2, Tensorflow, XGBoost, Dlib, and NCNN.
DeepDetect’s main features:
- Ready for applications of image tagging, object detection, segmentation, OCR, Audio, Video, Text classification, CSV for tabular data and time-series
- Web UI for training and managing models
- Fast training thanks to over 25 pre-trained models
- Fast Server written in pure C++, a single codebase for Cloud, Desktop and Embedded
- Neural network templates for the most effective architectures for GPU, CPU and Embedded devices
- Comes with ready-to-use models for a range of tasks, from object detection to OCR and sentiment analysis
11. Seldon Core
Seldon Core is an open-source platform with a framework that makes it easier and faster to deploy your machine learning models and experiments at scale on Kubernetes.
It’s a cloud-agnostic, secure, reliable and robust system maintained through a consistent security and updates policy.
- Easy way to containerize ML models using our pre-packaged inference servers, custom servers, or language wrappers
- Powerful and rich inference graphs made out of predictors, transformers, routers, combiners, and more
- Metadata provenance to ensure each model can be traced back to its respective training system, data, and metrics
- Advanced and customizable metrics with integration to Prometheus and Grafana.
- Full auditability through model input-output request (logging integration with Elasticsearch)
To wrap it up
There are plenty of tools for machine learning model serving to choose from. Before you go for your favorite, make sure it meets all your needs. Although similar, every tool offers different functionalities that may not be suitable for every ML practitioner.
Get started with Neptune in 5 minutes
If you are looking for an experiment tracking tool you may want to take a look at Neptune.
It takes literally 5 minutes to set up and as one of our happy users said:
“Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison.” – Edward Dixon, Data Scientist @intel
To get started follow these 4 simple steps.
Install the client library.
pip install neptune-client
Connect to the tool by adding a snippet to your training code.
import neptune neptune.init(...) # credentials neptune.create_experiment() # start logger
Specify what you want to log:
neptune.log_metric('accuracy', 0.92) for prediction_image in worst_predictions: neptune.log_image('worst predictions', prediction_image)
Run your experiment as you normally would:
And that’s it!
Your experiment is logged to a central experiment database and displayed in the experiment dashboard, where you can search, compare, and drill down to whatever information you need.Get your free account ->