Best Tools to Do ML Model Serving
Tools for model deployment and model serving in machine learning can provide you with solutions to many of the data engineers’ and DevOp’s concerns. They have many functionalities that make it easier to manage your models.
You can use them during the entire lifecycle of your ML project, beginning with building a trained model, to deploying, monitoring, providing easy accessibility, and production. They will automate and optimize your work, but also ensure there are no errors, make it easy to collaborate with others, and track changes in real-time.
Let’s take a look at the best tools that can help you in model serving!
Interested in other MLOps tools?
When building their ML pipelines, teams usually look into a few other components of the MLOps stack.
If that’s the case for you, here are a few article you should check:
BentoML standardizes model packaging and provides a simple way for users to deploy prediction services in a wide range of deployment environments. The company’s open-source framework aims to bridge the gap between Data Science and DevOps, enabling teams to deliver prediction services in a fast, repeatable, and scalable way.
Here’s a summary of BentoML:
- Standardized “Bento” format packages models, dependencies, and code
- Manages dependencies and packages for all major ML frameworks
- Deployable in any cloud environment with BentoCtl
- Online API serving via REST/GRPC or offline batch serving
- Automatically generates and provisions docker images for deployment
- High-performance API model server with adaptive micro-batching support
- Native Python support which scales inference workers separately from business logic
- Works as a central hub for managing models and deployment process via Web UI and APIs
Cortex is an open-source platform for deploying, managing, and scaling machine learning models. It’s a multi framework tool that lets you deploy all types of models.
Cortex is built on top of Kubernetes to support large-scale machine learning workloads.
Cortex – summary:
- Automatically scale APIs to handle production workloads
- Run inference on any AWS instance type
- Deploy multiple models in a single API and update deployed APIs without downtime
- Monitor API performance and prediction results
TensorFlow Serving is a flexible system for machine learning models, designed for production environments. It deals with the inference aspect of machine learning.
It takes models after training and manages their lifetimes to provide you with versioned access via a high-performance, reference-counted lookup table.
Here are some of the most important features:
- Can serve multiple models, or multiple versions of the same model at the same time
- Exposes both gRPC and HTTP inference endpoints
- Allows deployment of new model versions without changing your code
- Lets you flexibly test experimental models
- Its efficient, low-overhead implementation adds minimal latency to inference time
- Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations, and non-Tensorflow-based machine learning models
TorchServe is a flexible and easy to use tool for serving PyTorch models. It’s an open-source framework that makes it easy to deploy trained PyTorch models performantly at scale without having to write custom code. TorchServe delivers lightweight serving with low latency, so you can deploy your models for high-performance inference.
TorchServe is experimental and may still undergo some changes, but anyway, it offers some interesting functionalities.
- Multi-model serving
- Model versioning for A/B testing
- Metrics for monitoring
- RESTful endpoints for application integration
- Supports any machine learning environment, including Amazon SageMaker, Kubernetes, Amazon EKS, and Amazon EC2
- TorchServe can be used for many types of inference in production settings
- Provides an easy-to-use command-line interface
KFServing provides a Kubernetes Custom Resource Definition (CRD) for serving machine learning models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.
The tool provides a serverless machine learning inference solution that allows a consistent and
simple interface to deploy your models.
Main features of KFServing:
- Provides a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability
- Customizable InferenceService to add your resource requests for CPU, GPU, TPU and memory requests and limits
- Batching individual model inference requests
- Traffic management
- Scale to and from Zero
- Revision management
- Request/Response logging
- Scalable Multi Model Serving
Multi Model Server
Multi Model Server (MMS) is a flexible and easy to use tool for serving deep learning models trained using any ML/DL framework. The tool can be used for many types of inference in production settings. It provides an easy-to-use command line interface and utilizes REST-based APIs handle state prediction requests.
You can use the MMS Server CLI, or the pre-configured Docker images, to start a service that sets up HTTP endpoints to handle model inference requests.
- Advanced configurations allow to deep customize MMS’s behavior
- Ability to develop custom inference services
- Housekeeping unit tests for MMS
- JMeter to run MMS through the paces and collect benchmark data
- Multi model server benchmarking
- Model serving with Amazon Elastic Inference
- ONNX model export feature supports different models of deep learning frameworks
Triton Inference Server
Triton Inference Server provides an optimized cloud and edge inferencing solution. It’s optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server.
For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.
Key features of Triton:
- Supports multiple deep-learning frameworks (TensorRT, TensorFlow GraphDef, TensorFlow SavedModel, ONNX, and PyTorch TorchScript)
- Simultaneous model execution on the same GPU or on multiple GPUs
- Dynamic batching
- Extensible backends
- Supports model ensemble
- Metrics in Prometheus data format indicating GPU utilization, server throughput, and server latency
ForestFlow is an LF AI Foundation incubation project licensed under the Apache 2.0 license.
It is a scalable policy-based cloud-native machine learning model server for easily deploying and managing ML models.
It provides data scientists a simple means to deploy models to a production system with minimal friction accelerating the development to the production value proposition.
Here are ForestFlow main features:
- Can be run as a single instance (laptop or server) or deployed as a cluster of nodes that work together and automatically manage and distribute work.
- Offers native Kubernetes integration for easily deploying on Kubernetes clusters with little configuration
- Allows for model deployment in Shadow Mode
- Automatically scales down (hydrates) models and resources when not in use and automatically re-hydrates models back into memory to keep it efficient
- Allows to deploy models for multiple use-cases and chose between different routing policies to direct inference traffic between model variants serving each use-case
DeepDetect is a deep learning API and server written in C++11, along with a pure Web Platform for training and managing models.
DeepDetect aims at making the state of the art deep learning easy to work with and integrate into existing applications. It has support for backend machine learning libraries Caffe, Caffe2, Tensorflow, XGBoost, Dlib, and NCNN.
DeepDetect’s main features:
- Ready for applications of image tagging, object detection, segmentation, OCR, Audio, Video, Text classification, CSV for tabular data and time-series
- Web UI for training and managing models
- Fast training thanks to over 25 pre-trained models
- Fast Server written in pure C++, a single codebase for Cloud, Desktop and Embedded
- Neural network templates for the most effective architectures for GPU, CPU and Embedded devices
- Comes with ready-to-use models for a range of tasks, from object detection to OCR and sentiment analysis
Seldon Core is an open-source platform with a framework that makes it easier and faster to deploy your machine learning models and experiments at scale on Kubernetes.
It’s a cloud-agnostic, secure, reliable and robust system maintained through a consistent security and updates policy.
- Easy way to containerize ML models using our pre-packaged inference servers, custom servers, or language wrappers
- Powerful and rich inference graphs made out of predictors, transformers, routers, combiners, and more
- Metadata provenance to ensure each model can be traced back to its respective training system, data, and metrics
- Advanced and customizable metrics with integration to Prometheus and Grafana.
- Full auditability through model input-output request (logging integration with Elasticsearch)
DeepSparse is an inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application. It uses sparsification, a powerful technique for optimizing models for inference, reducing the compute needed with a limited accuracy tradeoff.
DeepSparse is designed to take advantage of model sparsity, enabling you to deploy models with the flexibility and scalability of software on commodity CPUs with the best-in-class performance of hardware accelerators, enabling you to standardize operations and reduce infrastructure costs.
Key features of DeepSparse:
- Use deepsparse.server to start deploying models immediately
- Provides NLP and computer vision pipelines similar to Hugging Face Pipelines
- It offers cost-effective deployment because GPUs are not needed to achieve top-notch performance
- Can be deployed on-premise, in the cloud, or on the edge
- Optimizes model input by bucketing sequences leading to fast inference
- DeepSparse can change the batch size of inputs based on the available workload ensuring optimal resource utilization
To wrap it up
There are plenty of tools for machine learning model serving to choose from. Before you go for your favorite, make sure it meets all your needs. Although similar, every tool offers different functionalities that may not be suitable for every ML practitioner.