As an MLOps practitioner, you know firsthand the challenges of deploying machine learning models in real-world production environments. Staying informed about the latest MLOps best practices adopted by other production teams is a shortcut to doing things well the first time.
These practices are essential to avoid hidden technical debt in ML systems that can arise as your system ages or becomes more complex. In fact, this has been studied by experts and they have concluded that:
…developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.
As mentioned by D. Sculley et al. in their paper “Hidden Technical Debt in Machine Learning Systems” published in NIPS 2015.
You may be thinking that the quotes are meant for sophisticated systems but yours is not big enough right now. I have seen both sides. While it’s relatively easy to develop a model and get stakeholder validation, operating that model in production comes with issues like model performance degradation.
To mitigate this, newer versions of the model must be constantly shipped. This calls for continuous training and continuous monitoring in addition to the DevOps practices of CI/CD.
And then you want/need to monitor those models. As you have more models in prod you want naming schemas and a registry for model artifacts and packages.
And so on
So, in this article, we’re going to explore some of the best practices engineers need to consistently deliver the machine learning systems their organizations need.

Adhere to naming conventions for code
Naming conventions aren’t new. For example, Python’s recommendations for naming conventions are included in PEP 8: Style Guide for Python Code. As machine learning systems grow, so does the number of variables.

So, if you establish a clear naming convention for your project, engineers will understand the roles of different variables, and conform to this convention as the project grows in complexity.
Tname_merge and intermediate_data_name_featurize. They follow an easily recognizable naming convention.
Here’s an example of using PEP8 naming conventions in a project that builds a basic google cloud pipeline:
from google.cloud import storage, pubsub_v1
def process_data(event, context):
# Extract data from the Pub/Sub message
data = event['data'].decode('utf-8')
# Perform data processing logic here
processed_data = data.upper()
# Write the processed data to Cloud Storage
storage_client = storage.Client()
bucket = storage_client.bucket('your_bucket_name')
blob = bucket.blob('processed_data.txt')
blob.upload_from_string(processed_data)
# Log the successful processing
print('Data processed and stored in Cloud Storage')
def create_pipeline(project_id, topic_name, subscription_name, function_name):
# Initialize Pub/Sub client
publisher_client = pubsub_v1.PublisherClient()
subscriber_client = pubsub_v1.SubscriberClient()
# Create a Pub/Sub topic
topic_path = publisher_client.topic_path(project_id, topic_name)
topic = publisher_client.create_topic(request={"name": topic_path})
# Create a Pub/Sub subscription
subscription_path = subscriber_client.subscription_path(project_id, subscription_name)
subscription = subscriber_client.create_subscription(
request={"name": subscription_path, "topic": topic_path}
)
# Create a Cloud Function trigger for the subscription
function_url = f"https://YOUR_REGION-YOUR_PROJECT_ID.cloudfunctions.net/{function_name}"
subscriber_client.modify_push_config(
request={"subscription": subscription_path, "push_config": {"push_endpoint": function_url}}
)
print('Pipeline created successfully.')
# Specify your project details
project_id = 'your_project_id'
topic_name = 'your_topic_name'
subscription_name = 'your_subscription_name'
function_name = 'your_function_name'
# Create the pipeline
create_pipeline(project_id, topic_name, subscription_name, function_name)
Looking closely you’ll see:
- Variable and function names are in lowercase, separated by underscores.
- For example, storage_client, publisher_client, subscriber_client.
- Constants are in uppercase, separated by underscores. For example, project_id, topic_name, subscription_name, function_name.
- Classes follow the CapWords convention. For example, PublisherClient, SubscriberClient.
- Indentation is done using four spaces.
By adhering to PEP 8 naming conventions, the code becomes more readable and consistent, making it easier to understand and maintain.
Code quality checks
Alexander Van Tol’s article on code quality puts forward three agreeable identifiers of high-quality code:
- It does what it is supposed to do
- It does not contain defects or problems
- It is easy to read, maintain and extend
These three identifiers are especially important for machine learning systems because of the CACE (Change Anything Change Everything) principle.
Consider a customer churn prediction model for a telecommunications company. During the feature engineering step, a bug in the code introduces an incorrect transformation, leading to flawed features used by the model. Without proper code quality checks, this bug can go unnoticed during development and testing.
Once deployed in production, the flawed feature affects the model’s predictions, resulting in inaccurate identification of customers at risk of churn. This can lead to financial losses and decreased customer satisfaction. Code quality checks (unit testing, in this case) keep crucial functions like this doing what they’re supposed to.
Still, code quality checks extend past unit testing. Your team stands to benefit from using linters and formatters to enforce a particular code style on your machine-learning project. This way you eliminate bugs before they reach production, detect code smells (dead code, duplicate code, etc.), and speed up the code review. This is a boost for your CI process.
It’s good practice to include this code quality check as the first step of a pipeline triggered by a pull request. You can see an example of this in the MLOps with AzureML template project. If you’d like to embrace linters as a team, here’s a great article to get you started – Linters aren’t in your way. They’re on your side.
Setup Experiment Tracking in MLOps system
Feature engineering, model architecture, and hyperparameter search all keep evolving. ML teams always aim to deliver the best possible system, given the current state of technology and the evolving patterns in the data.
On one hand, this means staying on top of the latest ideas and baselines. It also means experimenting with these ideas to see if they improve the performance of your machine-learning system.
Experimenting may involve trying out different combinations of code (preprocessing, training, and evaluation methods), data, and hyperparameters. Each unique combination produces metrics that you need to compare to your other experiments. Additionally, changes in the conditions (the environment) the experiment is run in may change the metrics you obtain.
It can quickly become tedious to recall what offered which benefits, and what worked. Using a modern tool (Neptune is a great one!) to track your experiments improves your productivity when you try out new processes, plus it makes your work reproducible.
Want to get started with experiment tracking with Neptune? Read this article – ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It.
Setup Data validation or DQ pipelines
In production, data may create a variety of issues. If the statistical properties of data are different from training data properties, the training data or the sampling process were faulty. Data drift might cause statistical properties to change for successive batches of data. The data might have unexpected features, some features may be passed in the wrong format, or, like the example in Erick Breck et al’s paper, a feature may be erroneously pinned to a specific value!
Serving data becomes training data eventually, so detecting errors in data is crucial to the long-term performance of ML models. Detecting errors as soon as they happen will let your team investigate and take appropriate action.
Pandera is a data validation library that helps you do this, as well as other complex statistical validations like hypothesis testing. Here’s an example of a data schema defined using Pandera.
import pandera as pa
from azureml.core import Run
run = Run.get_context(allow_offline=True)
if run.id.startswith("OfflineRun"):
import os
from azureml.core.dataset import Dataset
from azureml.core.workspace import Workspace
from dotenv import load_dotenv
load_dotenv()
ws = Workspace.from_config(path=os.getenv("AML_CONFIG_PATH"))
liko_data = Dataset.get_by_name("liko_data")
else:
liko_data = run.input_datasets["liko_data"]
df = liko_data.to_pandas_dataframe()
# ---------------------------------
# Include code to prepare data here
# ---------------------------------
liko_data_schema = pa.DataFrameSchema({
"Id": pa.Column(pa.Int, nullable=False),
"AccountNo": pa.Column(pa.Bool, nullable=False),
"BVN": pa.Column(pa.Bool, nullable=True, required=False),
"IdentificationType": pa.Column(pa.String checks=pa.Check.isin([
"NIN", "Passport", "Driver's license"
]),
"Nationality": pa.Column(pa.String, pa.Check.isin([
"NG", "GH", "UG", "SA"
]),
"DateOfBirth": pa.Column(
pa.DateTime,
nullable=True,
checks=pa.Check.less_than_or_equal_to('2000-01-01')
),
"*_Risk": pa.Column(
pa.Float,
coerce=True,
regex=True
)
}, ordered=True, strict=True)
run.log_table("liko_data_schema", liko_data_schema)
run.parent.log_table("liko_data_schema", liko_data_schema)
# -----------------------------------------------
# Include code to save dataframe to output folder
# -----------------------------------------------
This schema ensures that:
- ‘Id’ is an integer and never null
- ‘BVN’ is a boolean which may be absent in some data
- ‘IdentificationType’ is one of the four options listed
- ‘DateOfBirth’ is either null or less than ‘2000-01-01’
- Columns containing the string “_Risk” contain data that is coercible to float dtype.
- New data has columns in the same order as defined in this schema. This may be important, for example, when working with the XGBoost API which may throw an error for mismatched column order.
- No column not defined in this schema can be passed as part of serving data.
This simple schema builds a lot of data validation functionality into the project. The defined schema can then be applied in downstream steps as follows.
liko_data_schema.validate(data_sample)
Tensorflow also offers a comprehensive data validation API, documented here
Enable Model validation across segments
Reusing models is different from reusing software. You need to tune models to fit each new scenario. To do this, you need the training pipeline. Models also decay over time and need to be retrained in order to remain useful.
Experiment tracking can help us handle the versioning and reproducibility of models, but validating models before promoting them into production is also important.
You can validate offline or online. Offline validation includes producing metrics (e.g. accuracy, precision, normalized root mean squared error, etc) on the test dataset to evaluate the model’s fitness for the business objectives through historical data. The metrics will be compared to the existing production/baseline models before the promotion decision is made.
Efficient experiment tracking and metadata management give you pointers to all of these models, and you can do a rollback or promotion seamlessly. With online validation through A/B testing, as explored in this article, you then establish the adequate performance of the model on live data.
Aside from these, you should also validate the performance of the model on various segments of data to ensure that they meet requirements. The industry is increasingly noticing the bias that machine learning systems can learn from data. A popular example is the Twitter image-cropping feature, which was demonstrated to perform inadequately for some segments of users. Validating the performance of your model for different segments of users can help your team detect and correct this type of error.
Keep resource utilization in check: remember that your experiments cost money
During training and in use after deployment, models require system resources — CPU, GPU, I/O, and memory. Understanding the requirements of your system during the different phases can help your team optimize the cost of your experiments and maximize your budget.
This is an area of frequent concern. Companies are concerned about profits, they want to maximize resources to deliver value. Cloud services providers realize this also. Sireesha Muppala et al. share considerations about reducing training costs in Amazon SageMaker Debugger in their article. Microsoft Azure also allows engineers to determine the resource requirements of their model prior to deployment using the SDK.
This profiling tests the model with a provided dataset and reports recommendations for resource requirements. So, it’s important that the provided dataset is representative of what might be served when the model goes into production.
Profiling models also offer other advantages outside of cost. Sub-optimal resources may slow down training jobs or introduce latency into the operation of the model in production. These are bottlenecks that machine learning teams must identify and fix quickly.
Monitor predictive service performance
So far, the practices listed above can help you continuously deliver a robust machine learning system. In operation, there are other metrics that determine the performance of your deployed model, independent of training/serving data and model type. These metrics are arguably as important as the familiar project metrics (such as RMSE, AUC-ROC, etc.) that evaluate a model’s performance in relation to business objectives.
Users might need the output of machine learning models in real-time to ensure that decisions can be made quickly. Here, it’s vital to monitor operational metrics such as:
- Latency: measured in milliseconds. Are users guaranteed a seamless experience?
- Scalability: measured in Queries Per Second (QPS). How much traffic can your service handle at the expected latency?
- Service Update: How much downtime (service unavailability) is introduced during the update of your service’s underlying models?
For example, when fashion companies run advertising campaigns, the poor service performance of an ML recommendation system can impact conversion rates. Customers might get frustrated with service delays and move on without making a purchase. This translates to business losses.
Apache Bench is a tool from the Apache organization that lets you monitor these crucial metrics and make the right provisions for your organization’s needs. It’s important for these metrics to be measured across the different geographical locations for your service. Austin Gunter’s Measuring Latency with Apache Benchmark and this tutorial are also great introductions to this useful tool.
Choice of ML platforms
MLOps platforms can be difficult to compare. Still, your choice here can make or break your machine learning project. Your choice should be informed by:
- The team you have: the level of experience; subject matter experts or technical experts?
- Whether your project uses traditional machine learning or deep learning.
- The sort of data you will be working with.
- Your business objectives and budget.
- Technical requirements, such as how involved your model monitoring needs to be.
- The platform’s features and how they might evolve in the long run.
Several comparisons of ML platforms exist online to guide your choice, like the Top 12 On-Prem Tracking Tools in Machine Learning. Neptune is one of the platforms discussed. It makes collaboration easy and helps teams manage and monitor long-running experiments, either on-prem or in web UI. You can check out its main concepts here.
Open communication lines are important

Implementing and maintaining a machine learning system long-term means collaboration between a variety of professionals: teams of data engineers, data scientists, machine learning engineers, data visualization specialists, DevOps engineers, and software developers. UX designers and Product Managers can also affect how the product that serves your system interacts with users. Managers and Business owners have expectations that control how the performance of teams is evaluated and appreciated, while compliance professionals ensure that operations are in line with company policy and regulatory requirements.
If your machine learning system is going to keep achieving business objectives amidst evolving user and data patterns and expectations, then teams involved in its creation, operation, and monitoring must communicate effectively. Srimram Narayan explores how such multidisciplinary teams can adopt an outcome orientation to their setup and approach to business objectives in Agile IT Organization Design. Be sure to add it to your weekend reads
Score your ML system periodically
If you know all the practices above, it’s clear that you (and your team) are committed to instituting the best MLOps practices in your organization. You deserve some applause!
Scoring your machine learning system is both a great starting point for your endeavor and for continuous evaluation as your project ages. Thankfully, such a scoring system exists. Eric Breck et al. presented a comprehensive scoring system in their paper – What’s your ML Test Score? A rubric for ML production systems. The scoring system covers features and data, model development, infrastructure as well as monitoring.

MLOps Checklist
Having a clear requirement document is a good MLOps practice. This article by Timothy Wolodzko is a good starting point for clearing an initial set of requirements.
Once you’re clear with the foundation, here are some things to go through for a sanity check.

Have you shortlisted a model registry?
Model registries will enable Data Scientists to track their work for reproducible research, ML Engineers to get the right version for deployment, and organizations to audit for accuracy, compliance, and governance.
Deduplicate effort with a feature store
Will let data scientists reuse complex features across multiple models and let ML engineers deploy models without rewriting data-transformation logic.
Test code and validate assumptions
Code should have good unit test coverage and should be extended on every version update
Setup CI/CD/CT pipeline to automate deployments and retraining
CI/CD integration to ML deployments will make sure your builds are automated and releasing models would be more business effort than technical.
Monitor for drift
Ideal MLOps process would include a monitoring system to capture Drifts (Model/Data) reducing the risk of performance degradation over time.
Architect for reliability
Utilize serverless/containerization technologies for reliable uptimes and on-demand scalability.
Evaluate models continually and iterate
Last but not least, continuously evaluate models algorithmically by investigating anomalies in model performance/predictions. Use these learnings to upgrade the model’s output performance.
Conclusion
And that’s it! The 10 practices you should definitely consider implementing are:
- Naming conventions
- Code quality checks
- Experiment — and track your experiments!
- Data validation
- Model validation across segments
- Resource utilization: remember that your experiments cost money
- Monitor predictive service performance
- Think carefully about your choice of ML platforms
- Open communication lines are important
- Score your ML system periodically
Try them out, and you’ll definitely see some improvement in your work on ML systems
MLOps best practices FAQ
-
- Use an Experiment Tracker to track all aspects of your machine-learning experiments.
- Use a model registry to manage the versions of your models
- Automate the deployment of models to production.
- Use a version control system for your code and data.
- Document your code and experiments.
- Test your models thoroughly.
- care about the security and scalability of your deployed models
- Model Monitoring for performance and drifts
- Continuously improve your MLOps process with CI/CD
-
- Containerize your ML model (using technologies like Docker).
- Design for scalability and performance.
- Implement monitoring and logging for performance tracking.
- Maintain version control for model versions and enable rollbacks.
- Implement automated testing to validate model functionality.
- Ensure security and privacy measures are in place.
- Use CI/CD pipelines for automated deployment.
- Document the deployment process and share knowledge.
- Establish a feedback loop for model iteration and improvement.
- Adhere to governance and compliance standards.
Keep in mind there is no one size fits all approach. While some may not be required in your case, it’s better to plan beforehand. Here are some articles that entail the good deployment practices in detail and are worthy of checking out:
https://towardsdatascience.com/ml-model-deployment-strategies-72044b3c1410
https://www.sigmoid.com/blogs/5-best-practices-for-putting-ml-models-into-production/
-
- Naming your runs might be an underrated best practice but you don’t know how late you’ll realize.
- Utilize Model Staging functionality of MLflow to imitate prod environments and test the model out of bugs that can arise in prod before going to prod itself.
- Use python scripts instead of notebook to track experiments in MLflow. Notebooks are not easy to track and are only suited for experiments and visualization, not for prod environments or tracking.
Bad MLflow tracking when tracked from IPyKernel
Good tracking when tracked from python script
You can read more about such best practices in these articles:
https://www.asigmo.com/post/mlflow-best-practices-and-lessons-learned
https://censius.ai/blogs/mlflow-best-practices#blogpost-toc-3
- Reserve storage on-prem or on-cloud for pipeline parameters as the volume to artifacts grows exponentially as the experiment progresses.