Neptune Blog

MLOps Checklist – 10 Best Practices for a Successful Model Deployment

Odunayo Ogundepo

8 min

25th July, 2024

MLOps

As an MLOps practitioner, you know firsthand the challenges of deploying machine learning models in real-world production environments. Staying informed about the latest MLOps best practices adopted by other production teams is a shortcut to doing things well the first time.

These practices are essential to avoid hidden technical debt in ML systems that can arise as your system ages or becomes more complex. In fact, this has been studied by experts and they have concluded that:

…developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.

As mentioned by D. Sculley et al. in their paper “Hidden Technical Debt in Machine Learning Systems” published in NIPS 2015.

You may be thinking that the quotes are meant for sophisticated systems but yours is not big enough right now. I have seen both sides. While it’s relatively easy to develop a model and get stakeholder validation, operating that model in production comes with issues like model performance degradation.

To mitigate this, newer versions of the model must be constantly shipped. This calls for continuous training and continuous monitoring in addition to the DevOps practices of CI/CD.

And then you want/need to monitor those models. As you have more models in prod you want naming schemas and a registry for model artifacts and packages.

And so on

So, in this article, we’re going to explore some of the best practices engineers need to consistently deliver the machine learning systems their organizations need.

10 MLOps best practices — MLOps best practices

Adhere to naming conventions for code

Naming conventions aren’t new. For example, Python’s recommendations for naming conventions are included in PEP 8: Style Guide for Python Code. As machine learning systems grow, so does the number of variables.

Python underscore naming patterns | Modified based on source

So, if you establish a clear naming convention for your project, engineers will understand the roles of different variables, and conform to this convention as the project grows in complexity.

Tname_merge and intermediate_data_name_featurize. They follow an easily recognizable naming convention.

Here’s an example of using PEP8 naming conventions in a project that builds a basic google cloud pipeline:

from google.cloud import storage, pubsub_v1

def process_data(event, context):
    # Extract data from the Pub/Sub message
    data = event['data'].decode('utf-8')
    # Perform data processing logic here
    processed_data = data.upper()
    # Write the processed data to Cloud Storage
    storage_client = storage.Client()
    bucket = storage_client.bucket('your_bucket_name')
    blob = bucket.blob('processed_data.txt')
    blob.upload_from_string(processed_data)
    # Log the successful processing
    print('Data processed and stored in Cloud Storage')

def create_pipeline(project_id, topic_name, subscription_name, function_name):
    # Initialize Pub/Sub client
    publisher_client = pubsub_v1.PublisherClient()
    subscriber_client = pubsub_v1.SubscriberClient()

    # Create a Pub/Sub topic
    topic_path = publisher_client.topic_path(project_id, topic_name)
    topic = publisher_client.create_topic(request={"name": topic_path})

    # Create a Pub/Sub subscription
    subscription_path = subscriber_client.subscription_path(project_id, subscription_name)
    subscription = subscriber_client.create_subscription(
        request={"name": subscription_path, "topic": topic_path}
    )

    # Create a Cloud Function trigger for the subscription
    function_url = f"https://YOUR_REGION-YOUR_PROJECT_ID.cloudfunctions.net/{function_name}"
    subscriber_client.modify_push_config(
        request={"subscription": subscription_path, "push_config": {"push_endpoint": function_url}}
    )

    print('Pipeline created successfully.')

# Specify your project details
project_id = 'your_project_id'
topic_name = 'your_topic_name'
subscription_name = 'your_subscription_name'
function_name = 'your_function_name'

# Create the pipeline
create_pipeline(project_id, topic_name, subscription_name, function_name)

Looking closely you’ll see:

Variable and function names are in lowercase, separated by underscores.
For example, storage_client, publisher_client, subscriber_client.
Constants are in uppercase, separated by underscores. For example, project_id, topic_name, subscription_name, function_name.
Classes follow the CapWords convention. For example, PublisherClient, SubscriberClient.
Indentation is done using four spaces.

By adhering to PEP 8 naming conventions, the code becomes more readable and consistent, making it easier to understand and maintain.

Code quality checks

Alexander Van Tol’s article on code quality puts forward three agreeable identifiers of high-quality code:

It does what it is supposed to do
It does not contain defects or problems
It is easy to read, maintain and extend

These three identifiers are especially important for machine learning systems because of the CACE (Change Anything Change Everything) principle.

Consider a customer churn prediction model for a telecommunications company. During the feature engineering step, a bug in the code introduces an incorrect transformation, leading to flawed features used by the model. Without proper code quality checks, this bug can go unnoticed during development and testing.

Once deployed in production, the flawed feature affects the model’s predictions, resulting in inaccurate identification of customers at risk of churn. This can lead to financial losses and decreased customer satisfaction. Code quality checks (unit testing, in this case) keep crucial functions like this doing what they’re supposed to.

Still, code quality checks extend past unit testing. Your team stands to benefit from using linters and formatters to enforce a particular code style on your machine-learning project. This way you eliminate bugs before they reach production, detect code smells (dead code, duplicate code, etc.), and speed up the code review. This is a boost for your CI process.

It’s good practice to include this code quality check as the first step of a pipeline triggered by a pull request. You can see an example of this in the MLOps with AzureML template project. If you’d like to embrace linters as a team, here’s a great article to get you started – Linters aren’t in your way. They’re on your side.

Setup Experiment Tracking in MLOps system

Feature engineering, model architecture, and hyperparameter search all keep evolving. ML teams always aim to deliver the best possible system, given the current state of technology and the evolving patterns in the data.

On one hand, this means staying on top of the latest ideas and baselines. It also means experimenting with these ideas to see if they improve the performance of your machine-learning system.

Experimenting may involve trying out different combinations of code (preprocessing, training, and evaluation methods), data, and hyperparameters. Each unique combination produces metrics that you need to compare to your other experiments. Additionally, changes in the conditions (the environment) the experiment is run in may change the metrics you obtain.

It can quickly become tedious to recall what offered which benefits, and what worked. Using a modern tool (neptune.ai is a great one!) to track your experiments improves your productivity when you try out new processes, plus it makes your work reproducible.

Want to get started with experiment tracking with Neptune? Read this article – ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It.

Setup Data validation or DQ pipelines

In production, data may create a variety of issues. If the statistical properties of data are different from training data properties, the training data or the sampling process were faulty. Data drift might cause statistical properties to change for successive batches of data. The data might have unexpected features, some features may be passed in the wrong format, or, like the example in Erick Breck et al’s paper, a feature may be erroneously pinned to a specific value!

Serving data becomes training data eventually, so detecting errors in data is crucial to the long-term performance of ML models. Detecting errors as soon as they happen will let your team investigate and take appropriate action.

Pandera is a data validation library that helps you do this, as well as other complex statistical validations like hypothesis testing. Here’s an example of a data schema defined using Pandera.

import pandera as pa
from azureml.core import Run

run = Run.get_context(allow_offline=True)

if run.id.startswith("OfflineRun"):
    import os

    from azureml.core.dataset import Dataset
    from azureml.core.workspace import Workspace
    from dotenv import load_dotenv

    load_dotenv()

    ws = Workspace.from_config(path=os.getenv("AML_CONFIG_PATH"))

    liko_data = Dataset.get_by_name("liko_data")
else:
    liko_data = run.input_datasets["liko_data"]

df = liko_data.to_pandas_dataframe()

# ---------------------------------
# Include code to prepare data here 
# ---------------------------------

liko_data_schema = pa.DataFrameSchema({
    "Id": pa.Column(pa.Int, nullable=False),
    "AccountNo": pa.Column(pa.Bool, nullable=False),
    "BVN": pa.Column(pa.Bool, nullable=True, required=False),
    "IdentificationType": pa.Column(pa.String checks=pa.Check.isin([
        "NIN", "Passport", "Driver's license"
    ]),
    "Nationality": pa.Column(pa.String, pa.Check.isin([
        "NG", "GH", "UG", "SA"
    ]),
    "DateOfBirth": pa.Column(
        pa.DateTime,
        nullable=True,
        checks=pa.Check.less_than_or_equal_to('2000-01-01')
    ),
    "*_Risk": pa.Column(
        pa.Float,
        coerce=True,
        regex=True
    )
}, ordered=True, strict=True)

run.log_table("liko_data_schema", liko_data_schema)
run.parent.log_table("liko_data_schema", liko_data_schema)

# -----------------------------------------------
# Include code to save dataframe to output folder 
# -----------------------------------------------

This schema ensures that:

‘Id’ is an integer and never null
‘BVN’ is a boolean which may be absent in some data
‘IdentificationType’ is one of the four options listed
‘DateOfBirth’ is either null or less than ‘2000-01-01’
Columns containing the string “_Risk” contain data that is coercible to float dtype.
New data has columns in the same order as defined in this schema. This may be important, for example, when working with the XGBoost API which may throw an error for mismatched column order.
No column not defined in this schema can be passed as part of serving data.

This simple schema builds a lot of data validation functionality into the project. The defined schema can then be applied in downstream steps as follows.

liko_data_schema.validate(data_sample)

Tensorflow also offers a comprehensive data validation API, documented here

Enable Model validation across segments

Reusing models is different from reusing software. You need to tune models to fit each new scenario. To do this, you need the training pipeline. Models also decay over time and need to be retrained in order to remain useful.

Experiment tracking can help us handle the versioning and reproducibility of models, but validating models before promoting them into production is also important.

You can validate offline or online. Offline validation includes producing metrics (e.g. accuracy, precision, normalized root mean squared error, etc) on the test dataset to evaluate the model’s fitness for the business objectives through historical data. The metrics will be compared to the existing production/baseline models before the promotion decision is made.

Efficient experiment tracking and metadata management give you pointers to all of these models, and you can do a rollback or promotion seamlessly. With online validation through A/B testing, as explored in this article, you then establish the adequate performance of the model on live data.

Aside from these, you should also validate the performance of the model on various segments of data to ensure that they meet requirements. The industry is increasingly noticing the bias that machine learning systems can learn from data. A popular example is the Twitter image-cropping feature, which was demonstrated to perform inadequately for some segments of users. Validating the performance of your model for different segments of users can help your team detect and correct this type of error.

Keep resource utilization in check: remember that your experiments cost money

During training and in use after deployment, models require system resources — CPU, GPU, I/O, and memory. Understanding the requirements of your system during the different phases can help your team optimize the cost of your experiments and maximize your budget.

This is an area of frequent concern. Companies are concerned about profits, they want to maximize resources to deliver value. Cloud services providers realize this also. Sireesha Muppala et al. share considerations about reducing training costs in Amazon SageMaker Debugger in their article. Microsoft Azure also allows engineers to determine the resource requirements of their model prior to deployment using the SDK.

This profiling tests the model with a provided dataset and reports recommendations for resource requirements. So, it’s important that the provided dataset is representative of what might be served when the model goes into production.

Profiling models also offer other advantages outside of cost. Sub-optimal resources may slow down training jobs or introduce latency into the operation of the model in production. These are bottlenecks that machine learning teams must identify and fix quickly.

Monitor predictive service performance

So far, the practices listed above can help you continuously deliver a robust machine learning system. In operation, there are other metrics that determine the performance of your deployed model, independent of training/serving data and model type. These metrics are arguably as important as the familiar project metrics (such as RMSE, AUC-ROC, etc.) that evaluate a model’s performance in relation to business objectives.

Users might need the output of machine learning models in real-time to ensure that decisions can be made quickly. Here, it’s vital to monitor operational metrics such as:

Latency: measured in milliseconds. Are users guaranteed a seamless experience?
Scalability: measured in Queries Per Second (QPS). How much traffic can your service handle at the expected latency?
Service Update: How much downtime (service unavailability) is introduced during the update of your service’s underlying models?

For example, when fashion companies run advertising campaigns, the poor service performance of an ML recommendation system can impact conversion rates. Customers might get frustrated with service delays and move on without making a purchase. This translates to business losses.

Apache Bench is a tool from the Apache organization that lets you monitor these crucial metrics and make the right provisions for your organization’s needs. It’s important for these metrics to be measured across the different geographical locations for your service. Austin Gunter’s Measuring Latency with Apache Benchmark and this tutorial are also great introductions to this useful tool.

Choice of ML platforms

MLOps platforms can be difficult to compare. Still, your choice here can make or break your machine learning project. Your choice should be informed by:

The team you have: the level of experience; subject matter experts or technical experts?
Whether your project uses traditional machine learning or deep learning.
The sort of data you will be working with.
Your business objectives and budget.
Technical requirements, such as how involved your model monitoring needs to be.
The platform’s features and how they might evolve in the long run.

Several comparisons of ML platforms exist online to guide your choice, like the Top 12 On-Prem Tracking Tools in Machine Learning. Neptune is one of the platforms discussed. It makes collaboration easy and helps teams manage and monitor long-running experiments, either on-prem or in web UI. You can check out its main concepts here.

Open communication lines are important

Implementing and maintaining a machine learning system long-term means collaboration between a variety of professionals: teams of data engineers, data scientists, machine learning engineers, data visualization specialists, DevOps engineers, and software developers. UX designers and Product Managers can also affect how the product that serves your system interacts with users. Managers and Business owners have expectations that control how the performance of teams is evaluated and appreciated, while compliance professionals ensure that operations are in line with company policy and regulatory requirements.

If your machine learning system is going to keep achieving business objectives amidst evolving user and data patterns and expectations, then teams involved in its creation, operation, and monitoring must communicate effectively. Srimram Narayan explores how such multidisciplinary teams can adopt an outcome orientation to their setup and approach to business objectives in Agile IT Organization Design. Be sure to add it to your weekend reads

Score your ML system periodically

If you know all the practices above, it’s clear that you (and your team) are committed to instituting the best MLOps practices in your organization. You deserve some applause!

Scoring your machine learning system is both a great starting point for your endeavor and for continuous evaluation as your project ages. Thankfully, such a scoring system exists. Eric Breck et al. presented a comprehensive scoring system in their paper – What’s your ML Test Score? A rubric for ML production systems. The scoring system covers features and data, model development, infrastructure as well as monitoring.

ML scoring system by Eric Breck et al. | Modified based on source

MLOps Checklist

Having a clear requirement document is a good MLOps practice. This article by Timothy Wolodzko is a good starting point for clearing an initial set of requirements.

Once you’re clear with the foundation, here are some things to go through for a sanity check.

Have you shortlisted a model registry?

Model registries will enable Data Scientists to track their work for reproducible research, ML Engineers to get the right version for deployment, and organizations to audit for accuracy, compliance, and governance.

Deduplicate effort with a feature store

Will let data scientists reuse complex features across multiple models and let ML engineers deploy models without rewriting data-transformation logic.

Test code and validate assumptions

Code should have good unit test coverage and should be extended on every version update

Setup CI/CD/CT pipeline to automate deployments and retraining

CI/CD integration to ML deployments will make sure your builds are automated and releasing models would be more business effort than technical.

Monitor for drift

Ideal MLOps process would include a monitoring system to capture Drifts (Model/Data) reducing the risk of performance degradation over time.

Architect for reliability

Utilize serverless/containerization technologies for reliable uptimes and on-demand scalability.

Evaluate models continually and iterate

Last but not least, continuously evaluate models algorithmically by investigating anomalies in model performance/predictions. Use these learnings to upgrade the model’s output performance.

Conclusion

And that’s it! The 10 practices you should definitely consider implementing are:

Naming conventions
Code quality checks
Experiment — and track your experiments!
Data validation
Model validation across segments
Resource utilization: remember that your experiments cost money
Monitor predictive service performance
Think carefully about your choice of ML platforms
Open communication lines are important
Score your ML system periodically

Try them out, and you’ll definitely see some improvement in your work on ML systems

Was the article useful?

More about MLOps Checklist – 10 Best Practices for a Successful Model Deployment

Check out our product resources and related articles below:

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Product resource

How Veo Eliminated Work Loss With Neptune

LLMOps: What It Is, Why It Matters, and How to Implement It

Learnings From Building the ML Platform at Mailchimp

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025