MLOps Blog

The Real Cost of Self-Hosting MLflow

5 min
8th April, 2024

TL;DR

MLflow is a popular experiment-tracking and end-to-end ML platform

Since MLflow is open source, it’s free to download, and hosting an instance does not incur license fees

Hosting MLflow requires multiple infrastructure components and comes with maintenance responsibilities, the cost of which can be difficult to estimate

On AWS, which offers various options for hosting MLflow, a medium-sized instance comes in at about $200 per month, plus storage and data transfer costs

MLflow is well-regarded as an experiment-tracking platform. Since it’s open source, you can download it for free and host as many instances as you want without incurring license fees. This, and the extendability of MLflow, sees data science teams gravitating towards adopting it as their end-to-end machine learning solution.

However, hosting and operating an MLflow instance is not free. You need to provide the necessary computing and database infrastructure, which someone has to set up and manage. Further, your team will have to configure MLflow, keep it updated, and troubleshoot any issues.

Estimating the costs of hosting MLflow for a data science team can be difficult. So, let’s look at the cost of different deployment options to arrive at a realistic estimate. To be able to give specific numbers, I’ll focus on hosting options on AWS, but the general considerations apply to cloud platforms and on-premise options.

MLflow components

As a platform, MLflow is composed of three main components:

  • The tracking server exposes the user interface (UI) and acts as an intermediary between the MLflow client in your scripts and the backend and artifact stores.
  • The metadata store is where MLflow keeps the experiment and model metadata.
  • The artifact store is where models and other large binary artifacts are saved.

While it’s possible to use MLflow without the tracking server, teams that look to collaborate on experiments and share models will need this centralized hub. In my experience, even solo data scientists prefer setting up a tracking server rather than directly interfacing with metadata and artifact stores.

The canonical MLflow setup for teams
The canonical MLflow setup for teams: The MLflow client embedded in the training code communicates with the MLflow tracking server, which handles access to cloud storage (artifact store) and a database (metadata store). Team members use the MLflow tracking server’s UI to access experiment and model data and collaborate on projects. | Modified based on: source

Deploying the MLflow tracking server

MLflow’s tracking server is relatively lightweight. The application is stateless, i.e., it does not store any data. So you can turn it on and off as you’d like without losing data or even run several replicas simultaneously.

From the users’ perspective, it’s important that the tracking server is always available. After all, it exposes the UI, collects the metadata, and provides access to the model artifacts. For this reason, running on so-called spot instances (cheaper VMs that might be reallocated to other customers paying the full rate at any time) is not advisable.

With this in mind, there are three main options for deploying the MLflow tracking server on AWS:

$0.096 (on-demand hourly rate in us-east-1) x 24h x 30 days = $69.12

Note that the hourly rate differs between regions. By reserving an instance for a year, you can bring down this cost by about 40% to around $42 per month. If your company runs all its infrastructure on AWS, it’s likely that you won’t have to pay list prices.

  • Deploying the MLflow tracking server on AWS ECS backed by AWS Fargate.

    If you do not want to maintain an EC2 instance yourself, or you expect to only utilize the MLflow tracking server for parts of the day, ECS in combination with Fargate is an interesting option.

    Fargate is the serverless container option on AWS, spinning up and providing a Docker container only if requests are coming in. Thus, you’ll only pay when users are accessing the MLflow tracking server’s UI or are sending metadata. AWS provides a detailed tutorial for setting up MLflow on ECS/Fargate on their machine-learning blog.

    Whether this option is actually cheaper depends on access and load patterns. If you need the equivalent of an m5.large instance for five days a week, eight hours per day, it will cost you about $19 per month:

(2 x $0.04048 (vCPU per hour in us-east-1) + 8 x $0.004445 (GB per hour in us-east-1)) * 8 * 5 * 4 = $18.64

Keep in mind, however, that you might want to have multiple replicas operating at the same time during peak times and that your team or applications might need access outside of regular business hours.

  • Deploying the MLflow tracking server on Kubernetes.

    If your organization already runs a Kubernetes cluster (either through AWS EKS or a custom setup on AWS EC2), it’s worth exploring whether you can host the MLflow tracking server on it.

    The main benefit is that you can share resources with other applications. Even if you require the equivalent of an m5.large when the MLflow tracking server is fully utilized, you don’t need to reserve this capacity permanently (E.g., you could set the resource requests to “cpu: 0.5, memory: 2Gi” and the limits to “cpu: 2, memory: 8Gi”.) Helm charts for deploying MLflow on Kubernetes are available through Bitnami and community-charts.

    Another benefit of deploying the MLflow tracking server on Kubernetes is that there’s typically already someone who maintains and updates the applications on the cluster. Deploying on Kubernetes also gives you the flexibility to either use AWS-managed services for the metadata and artifact stores (as with the AWS EC2 and AWS ECS options) or to resort to a database and object store directly deployed to the cluster.

Deploying a database as the metadata store 

The second significant cost in an MLflow deployment is the database used to store experiment metadata and server settings.

The option that suggests itself on AWS is to utilize a MySQL database managed through Amazon RDS. A single db.m5.large instance is sufficient for relatively large MLflow deployments and costs around $123 per month:

$0.171 (on-demand price per hour in us-east-1) x 24h x 30 days = $123.12

Note that prices might differ between regions. You should also keep in mind that as you scale up, you might have to move to larger machines.

In addition to the database instance, you’ll also have to pay for storage. There are several options available with different access speeds. A general-purpose SSD (gp2) is the default choice and will cost you $0.115 per GB per month in us-east-1. Since MLflow keeps all larger objects in the artifact store, you’re probably not looking at more than a few tens of GB here, even if you run a lot of experiments.

You can also look into Amazon Aurora or consider self-hosting a database on EC2 or Kubernetes. If you opt to manage the database service yourself, you’ll need to handle operations like backups and updates, which can add significantly to the maintenance costs unless you already have a team in place that’s doing this work across the organization.

Setting up an artifact store

The artifact store is the third relevant cost item in an MLflow deployment. While the cost for the tracking server and the metadata store is typically independent of the types and size of models you work with, the costs associated with the artifact store will depend on it heavily.

Let’s assume that your team needs 1 TB of storage to keep model versions.

On AWS, the standard choice is to use AWS S3 as the artifact store. Storing 1TB of data will cost you around $23 per month:

$0.023 (standard price per GB per month in us-east-1) x 1024 = $23.55

Again, prices will vary between regions, and there is a discount if you store more than 50 TB.

You also have to consider transfer costs. While AWS does not charge extra for transferring data into S3, transferring data out costs $0.09GB for the first 10TB per month, with an AWS-wide free tier of 100 GB per month and a small discount if 10TB or more data is transferred. This charge does not apply when transferring data within the AWS ecosystem, with transfers within the same region often being free of charge.

On top of storage and transfer costs, AWS will also charge for every read and write request.

Whether storage, transfer, and access costs are significant items on your AWS cloud bill depends on your usage pattern and infrastructure setup. If you work with small models that you update and deploy only occasionally, it’ll cost you a few dollars per month at most. However, if you’re fine-tuning LLMs for hundreds of customers each day and are deploying them outside of the AWS environment, storage and transfer costs can easily become the dominant item.

Alternatives to using AWS S3 as the artifact store include attaching storage volumes to the EC2 instance hosting the MLflow tracking server or using an object store like MinIO when hosting MLflow on Kubernetes. Depending on your ML infrastructure setup and usage patterns, these solutions can be cheaper but will require more manual configuration and maintenance effort.

Maintaining an MLflow deployment

The majority of maintenance effort required for an MLflow deployment is associated with the infrastructure and resources we just discussed. In particular, you’ll want to monitor resource utilization to see if you need to upgrade to maintain the performance level or can downgrade to save costs. The more custom your setup is, the more often you’ll have to resolve issues around connectivity between components.

Maintenance of MLflow itself is usually limited to updating the software to a new version, which most teams typically do once or twice a year. However, if there’s a critical security issue, you’ll want to update to a patched version as soon as possible.

Depending on the salaries of the people doing the work, the costs of maintaining MLflow can easily outgrow the hosting costs. This is particularly true if you cannot rely on a dedicated DevOps or infrastructure support team, but your data science or ML team using MLflow has to do all the work. In that case, you have to not only factor in the relative lack of operations experience, but also keep in mind that every hour working on MLflow maintenance is one less hour spent on your team’s primary task.

User management and compliance

MLflow only provides password-based authentication by default. You can integrate it with authentication protocols like OAuth or LDAP, but you’ll have to do this on your own.

Further, everyone who has access to your MLflow tracking server will be able to see and modify all experiments and models. If you have to ensure that specific resources, such as experiments and models, can only be accessed by authorized individuals, you’ll have to add role-based access control (RBAC) yourself or host several fully separate MLflow deployments.

If your company’s policies require that data remains encrypted, you’ll have to do that yourself as well. You are also responsible for regularly conducting vulnerability assessments and mitigating potential risks.

Conclusion

To sum up, the primary costs associated with deploying and hosting MLflow revolve around the server, the metadata store, and the artifact store.

In total, based on our estimates above, an MLflow deployment for a small data science team will come in at $200 for running the server and the database, plus storage and data transfer costs. 

The costs of self-hosting MLflow can be minimized by using reserved instances, resorting to serverless services, or self-managing the database. Whether this is viable for you depends on the DevOps support in your organization and your usage and load patterns.

In any case, we have seen that while MLflow is freely available as open-source software, hosting it is far from free and can put significant responsibilities on your team. Instead of self-hosting, relying on a managed platform offered as SaaS might come out to be cheaper at the end of the day. All in all, when it comes down to it, you need to balance the money you spend with what your organization needs, what resources you have at your disposal, and the operations expertise of your team.

Was the article useful?

Thank you for your feedback!
What topics would you like to see for your next read
Let us know what should be improved

    Thanks! Your suggestions have been forwarded to our editors