Blog » ML Tools » Best Tools to Do ML Model Serving

Best Tools to Do ML Model Serving

Tools for model serving in machine learning can provide you with solutions to many of the data engineers and devops concerns. They have many functionalities that make it easier to manage your models.

You can use them during the entire lifecycle of your ML project, beginning with building a trained model, to deploying, monitoring, providing easy accessibility, and production. They will automate and optimize your work, but also ensure there are no errors, make it easy to collaborate with others, and track changes in real-time.

Let’s take a look at the best tools that can help you in model serving!

1. BentoML


BentoML is a framework for serving, managing, and deploying machine learning models. It’s aim is to bridge the gap between Data Science and DevOps, and enable teams to deliver prediction services in a fast, repeatable, and scalable way.

It’s a tool that helps to build and ship prediction services, instead of uploading pickled model files or Protobuf files to a server.

Here’s a summary of BentoML:

  • Package models trained with any ML frameworks and reproduce them for model serving in production
  • Deploy anywhere for online API serving or offline batch serving
  • High-Performance API model server with adaptive micro-batching support
  • Works as a central hub for managing models and deployment process via Web UI and APIs
  • Modular and flexible design allows you to adapt the tool to your infrastructure

2. Cortex

Cortex is an open-source platform for deploying, managing, and scaling machine learning models. It’s a multi framework tool that lets you deploy all types of models.

Cortex is built on top of Kubernetes to support large-scale machine learning workloads.

Cortex – summary:

  • Automatically scale APIs to handle production workloads
  • Run inference on any AWS instance type
  • Deploy multiple models in a single API and update deployed APIs without downtime
  • Monitor API performance and prediction results

3. TensorFlow Serving

Tensorflow serving

TensorFlow Serving is a flexible system for machine learning models, designed for production environments. It deals with the inference aspect of machine learning.

It takes models after training and manages their lifetimes to provide you with versioned access via a high-performance, reference-counted lookup table. 

Here are some of the most important features:

  • Can serve multiple models, or multiple versions of the same model at the same time
  • Exposes both gRPC and HTTP inference endpoints
  • Allows deployment of new model versions without changing your code
  • Lets you flexibly test experimental models
  • Its efficient, low-overhead implementation adds minimal latency to inference time 
  • Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations, and non-Tensorflow-based machine learning models

4. TorchServe

TorchServe is a flexible and easy to use tool for serving PyTorch models. It’s an open-source framework that makes it easy to deploy trained PyTorch models performantly at scale without having to write custom code. TorchServe delivers lightweight serving with low latency, so you can deploy your models for high-performance inference.

TorchServe is experimental and may still undergo some changes, but anyway, it offers some interesting functionalities.

TorchServe–main features:

  • Multi-model serving
  • Model versioning for A/B testing
  • Metrics for monitoring
  • RESTful endpoints for application integration
  • Supports any machine learning environment, including Amazon SageMaker, Kubernetes, Amazon EKS, and Amazon EC2
  • TorchServe can be used for many types of inference in production settings
  • Provides an easy-to-use command-line interface

5. KFServing

KF serving

KFServing provides a Kubernetes Custom Resource Definition (CRD) for serving machine learning models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.

The tool provides a serverless machine learning inference solution that allows a consistent and

simple interface to deploy your models.

Main features of KFServing:

  • Provides a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability
  • Customizable InferenceService to add your resource requests for CPU, GPU, TPU and memory requests and limits
  • Batching individual model inference requests
  • Traffic management
  • Scale to and from Zero
  • Revision management
  • Request/Response logging
  • Scalable Multi Model Serving

6. Multi Model Server

Multi Model Server (MMS) is a flexible and easy to use tool for serving deep learning models trained using any ML/DL framework. The tool can be used for many types of inference in production settings. It provides an easy-to-use command line interface and utilizes REST-based APIs handle state prediction requests.

You can use the MMS Server CLI, or the pre-configured Docker images, to start a service that sets up HTTP endpoints to handle model inference requests.

Key features:

  • Advanced configurations allow to deep customize MMS’s behavior
  • Ability to develop custom inference services
  • Housekeeping unit tests for MMS
  • JMeter to run MMS through the paces and collect benchmark data
  • Multi model server benchmarking
  • Model serving with Amazon Elastic Inference
  • ONNX model export feature supports different models of deep learning frameworks

7. Triton Inference Server

Triton Inference Server

Triton Inference Server provides an optimized cloud and edge inferencing solution. It’s optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server.

For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.

Key features of Triton:

  • Supports multiple deep-learning frameworks (TensorRT, TensorFlow GraphDef, TensorFlow SavedModel, ONNX, and PyTorch TorchScript)
  • Simultaneous model execution on the same GPU or on multiple GPUs
  • Dynamic batching
  • Extensible backends
  • Supports model ensemble
  • Metrics in Prometheus data format indicating GPU utilization, server throughput, and server latency

8. ForestFlow


ForestFlow is an LF AI Foundation incubation project licensed under the Apache 2.0 license.

It is a scalable policy-based cloud-native machine learning model server for easily deploying and managing ML models.

It provides data scientists a simple means to deploy models to a production system with minimal friction accelerating the development to the production value proposition.

Here are ForestFlow main features:

  • Can be run as a single instance (laptop or server) or deployed as a cluster of nodes that work together and automatically manage and distribute work.
  • Offers native Kubernetes integration for easily deploying on Kubernetes clusters with little configuration
  • Allows for model deployment in Shadow Mode
  • Automatically scales down (hydrates) models and resources when not in use and automatically re-hydrates models back into memory to keep it efficient
  • Multi-tenancy
  • Allows to deploy models for multiple use-cases and chose between different routing policies to direct inference traffic between model variants serving each use-case

9. DeepDetect


DeepDetect is a deep learning API and server written in C++11, along with a pure Web Platform for training and managing models.

DeepDetect aims at making the state of the art deep learning easy to work with and integrate into existing applications. It has support for backend machine learning libraries Caffe, Caffe2, Tensorflow, XGBoost, Dlib, and NCNN.

DeepDetect’s main features:

  • Ready for applications of image tagging, object detection, segmentation, OCR, Audio, Video, Text classification, CSV for tabular data and time-series
  • Web UI for training and managing models
  • Fast training thanks to over 25 pre-trained models
  • Fast Server written in pure C++, a single codebase for Cloud, Desktop and Embedded
  • Neural network templates for the most effective architectures for GPU, CPU and Embedded devices
  • Comes with ready-to-use models for a range of tasks, from object detection to OCR and sentiment analysis

10. Seldon Core

Seldon Core

Seldon Core is an open-source platform with a framework that makes it easier and faster to deploy your machine learning models and experiments at scale on Kubernetes.

It’s a cloud-agnostic, secure, reliable and robust system maintained through a consistent security and updates policy.

Seldon Core–summary:

  • Easy way to containerize ML models using our pre-packaged inference servers, custom servers, or language wrappers
  • Powerful and rich inference graphs made out of predictors, transformers, routers, combiners, and more
  • Metadata provenance to ensure each model can be traced back to its respective training system, data, and metrics
  • Advanced and customizable metrics with integration to Prometheus and Grafana.
  • Full auditability through model input-output request (logging integration with Elasticsearch)

To wrap it up

There are plenty of tools for machine learning model serving to choose from. Before you go for your favorite, make sure it meets all your needs. Although similar, every tool offers different functionalities that may not be suitable for every ML practitioner.


Best Machine Learning Model Management Tools That You Need to Know

9 mins read | Author Vladimir Lyashenko | Updated July 14th, 2021

Developing your model is an essential part of working on ML projects. And it’s usually a tough challenge. 

Every data scientist has to face it, along with difficulties, like losing track of experiments. These difficulties are likely to be both annoying and unobvious, which will make you feel confused from time to time.

That’s why it’s good to streamline the process of managing your ML model, and luckily there are several tools for that. These tools can help with things like:

  • Experiment tracking
  • Model versioning
  • Measuring inference time
  • Team collaboration
  • Resource monitoring

So it’s common sense and good practice to find and use tools suitable for your projects.

In this article, we’ll explore the landscape of model management tools. I’ll try to show you the variety of tools and highlight what’s good about them.

We’ll cover:

  • Criteria for choosing a model management tool
  • Model management toolsNeptune, Amazon SageMaker, Azure Machine Learning, Domino Data Science Platform, Google Cloud AI Platform, Metaflow, MLflow
Continue reading ->
Model deployment tools

Best 8 Machine Learning Model Deployment Tools That You Need to Know

Read more
MLOps guide

MLOps: What It Is, Why it Matters, and How To Implement It

Read more
Experiment tracking Experiment management

15 Best Tools for ML Experiment Tracking and Management

Read more
GreenSteam MLOps toolstack

MLOps at GreenSteam: Shipping Machine Learning [Case Study]

Read more