“…developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.” – D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems, NIPS 2015
Every data scientist can relate to this quote. Perhaps you have encountered it in your search to solve a problem in one of the many moving parts of your machine learning system: data, model, or code.
Hacking together a solution usually means incurring technical debt, which grows as your system ages and/or grows in complexity. Worse, you could lose time, waste compute resources and cause production issues.
MLOps can be daunting. Thousands of courses are available to help engineers improve their machine learning skills. While it’s relatively easy to develop a model to achieve business objectives (item classification or predicting a continuous variable) and deploy it to production, operating that model in production comes with a myriad of issues.
READ MORE ABOUT MLOPS
MLOps: What It Is, Why it Matters, and How To Implement It
The Best MLOps Tools You Need to Know as a Data Scientist
Model performance may degrade in production for reasons such as data drift. You might need to change the preprocessing technique. This means new models need to be shipped into production constantly to address performance decline, or improve model fairness.
This calls for continuous training and continuous monitoring in addition to the DevOps practices of continuous integration and continuous delivery. So, in this article we’re going to explore some of the best practices engineers need to consistently deliver the machine learning systems their organizations need.
Naming conventions
Naming conventions aren’t new. For example, Python’s recommendations for naming conventions are included in PEP 8: Style Guide for Python Code. As machine learning systems grow, so does the number of variables.
So, if you establish a clear naming convention for your project, engineers will understand the roles of different variables, and conform to this convention as the project grows in complexity.

This practice helps mitigate the challenge of the Changing Anything Changes Everything (CACE) principle. It also helps team members establish familiarity with your project quickly. Here’s an example from a project that builds Azure machine learning pipelines.
from azureml.pipeline.core import PipelineData
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
intermediate_data_name_merge = "merged_ibroka_data"
merged_ibroka_data = (PipelineData(intermediate_data_name_merge, datastore=blob_datastore)
.as_dataset()
.parse_parquet_files()
.register(name=intermediate_data_name_merge, create_new_version=True)
)
mergeDataStep = PythonScriptStep(name="Merge iBroka Data",
script_name="merge.py",
arguments=[
merged_ibroka_data,
"--input_client_data", intermediate_data_name_client,
"--input_transactions_data", intermediate_data_name_transactions
],
inputs=[cleansed_client_data.as_named_input(intermediate_data_name_client),
cleansed_transactions_data.as_named_input(intermediate_data_name_transactions)],
outputs=[merged_liko_data],
compute_target=aml_compute,
runconfig=aml_runconfig,
source_directory="scripts/",
allow_reuse=True
) #
print("mergeDataStep created")
intermediate_data_name_featurize = "featurized_liko_data"
featurized_ibroka_data = (PipelineData(intermediate_data_name_featurize, datastore=blob_datastore)
.as_dataset()
.parse_parquet_files()
.register(name=intermediate_data_name_featurize, create_new_version=True)
)
featurizeDataStep = PythonScriptStep(name="Featurize iBroka Data",
script_name="featurize.py",
arguments=[
featurized_liko_data,
"--input_merged_data", intermediate_data_name_merge,
],
inputs=[merged_liko_data.as_named_input(intermediate_data_name_merge)],
outputs=[featurized_liko_data],
compute_target=aml_compute,
runconfig=aml_runconfig,
source_directory="scripts/",
allow_reuse=True
)
print("featurizeDataStep created")
Here, the intermediate outputs of the two steps of the pipeline are named intermediate_data_name_merge and intermediate_data_name_featurize. They follow an easily recognizable naming convention.
If another such variable, say intermediate_data_name_clean, was encountered in another aspect of the project, this naming convention makes it easy to understand what role it plays in the larger project.
Code quality checks
Alexander Van Tol’s article on code quality puts forward three agreeable identifiers of high-quality code:
- It does what it is supposed to do
- It does not contain defects or problems
- It is easy to read, maintain and extend
These three identifiers are especially important for machine learning systems because of the CACE principle.
Frequently, real-world data fed into training pipelines doesn’t have the outcome variable explicitly contained in it. As an example, think of an SQL database containing subscription transactions. There may not be a column that says whether a particular subscription was renewed or not. However, it’s easy to look through subsequent transactions and see whether said subscription was discontinued upon expiry.
This computation of the outcome variable may happen in one step of the training pipeline. If there’s any issue with the function that performs this computation, the model will be fitted on the wrong training data, and won’t do well in production. Code quality checks (unit testing, in this case) keep crucial functions like this doing what they’re supposed to.
Still, code quality checks extend past unit testing. Your team stands to benefit from using linters and formatters to enforce a particular code style on your machine learning project. This way you eliminate bugs before they reach production, detect code smells (dead code, duplicate code etc.), and speed up the code review. This is a boost for your CI process.
It’s good practice to include this code quality check as the first step of a pipeline triggered by a pull request. You can see an example of this in the MLOps with AzureML template project. If you’d like to embrace linters as a team, here’s a great article to get you started – Linters aren’t in your way. They’re on your side
Experiment — and track your experiments!
Feature engineering, model architecture, and hyperparameter search all keep evolving. ML teams always aim to deliver the best possible system given the current state of technology and the evolving patterns in the data.
On one hand, this means staying on top of the latest ideas and baselines. It also means experimenting with these ideas to see if they improve the performance of your machine learning system.
Experimenting may involve trying out different combinations of code (preprocessing, training and evaluation methods), data, and hyperparameters. Each unique combination produces metrics that you need to compare to your other experiments. Additionally, changes in the conditions (the environment) the experiment is run in may change the metrics you obtain.
It can quickly become tedious to recall what offered which benefits, and what worked. Using a modern tool (Neptune is a great one!) to track your experiments improves your productivity when you try out new processes, plus it makes your work reproducible.
Want to get started with experiment tracking with Neptune? Read this article – ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It.
Data validation
In production, data may create a variety of issues. If the statistical properties of data are different from training data properties, the training data or the sampling process were faulty. Data drift might cause statistical properties to change for successive batches of data. The data might have unexpected features, some features may be passed in the wrong format or, like the example in Erick Breck et al’s paper, a feature may be erroneously pinned to a specific value!
Serving data becomes training data eventually, so detecting errors in data is crucial to the long-term performance of ML models. Detecting errors as soon as they happen will let your team investigate and take appropriate action.
Pandera is a data validation library that helps you do this, as well as other complex statistical validations like hypothesis testing. Here’s an example of a data schema defined using Pandera.
import pandera as pa
from azureml.core import Run
run = Run.get_context(allow_offline=True)
if run.id.startswith("OfflineRun"):
import os
from azureml.core.dataset import Dataset
from azureml.core.workspace import Workspace
from dotenv import load_dotenv
load_dotenv()
ws = Workspace.from_config(path=os.getenv("AML_CONFIG_PATH"))
liko_data = Dataset.get_by_name("liko_data")
else:
liko_data = run.input_datasets["liko_data"]
df = liko_data.to_pandas_dataframe()
# ---------------------------------
# Include code to prepare data here
# ---------------------------------
liko_data_schema = pa.DataFrameSchema({
"Id": pa.Column(pa.Int, nullable=False),
"AccountNo": pa.Column(pa.Bool, nullable=False),
"BVN": pa.Column(pa.Bool, nullable=True, required=False),
"IdentificationType": pa.Column(pa.String checks=pa.Check.isin([
"NIN", "Passport", "Driver's license"
]),
"Nationality": pa.Column(pa.String, pa.Check.isin([
"NG", "GH", "UG", "SA"
]),
"DateOfBirth": pa.Column(
pa.DateTime,
nullable=True,
checks=pa.Check.less_than_or_equal_to('2000-01-01')
),
"*_Risk": pa.Column(
pa.Float,
coerce=True,
regex=True
)
}, ordered=True, strict=True)
run.log_table("liko_data_schema", liko_data_schema)
run.parent.log_table("liko_data_schema", liko_data_schema)
# -----------------------------------------------
# Include code to save dataframe to output folder
# -----------------------------------------------
This schema ensures that:
- ‘Id’ is an integer and never null
- ‘BVN’ is a boolean which may be absent in some data
- ‘IdentificationType’ is one of the four options listed
- ‘DateOfBirth’ is either null or less than ‘2000-01-01’
- Columns containing the string “_Risk” contain data that is coercible to float dtype.
- New data has columns in the same order as defined in this schema. This may be important, for example, when working with the XGBoost API which may throw an error for mismatched column order.
- No column not defined in this schema can be passed as part of serving data.
This simple schema builds a lot of data validation functionality into the project. The defined schema can then be applied in downstream steps as follows.
liko_data_schema.validate(data_sample)
Tensorflow also offers a comprehensive data validation API, documented here.
Model validation across segments
Reusing models is different from reusing software. You need to tune models to fit each new scenario. To do this, you need the training pipeline. Models also decay over time, and need to be retrained in order to remain useful.
Experiment tracking can help us handle the versioning and reproducibility of models, but validating models before promoting them into production is also important.
You can validate offline or online. Offline validation includes producing metrics (e.g. accuracy, precision, normalized root mean squared error, etc) on the test dataset to evaluate the model’s fitness for the business objectives through historical data. The metrics will be compared to the existing production/baseline models before the promotion decision is made.
Proper experiment tracking and metadata management gives you pointers to all of these models, and you can do a rollback or promotion seamlessly. With online validation through A/B testing, as explored in this article, you then establish the adequate performance of the model on live data.
Aside from these, you should also validate the performance of the model on various segments of data to ensure that they meet requirements. The industry is increasingly noticing the bias that machine learning systems can learn from data. A popular example is the Twitter image-cropping feature, which was demonstrated to perform inadequately for some segments of users. Validating the performance of your model for different segments of users can help your team detect and correct this type of error.
Resource utilization: remember that your experiments cost money
During training and in use after deployment, models require system resources — CPU, GPU, I/O and memory. Understanding the requirements of your system during the different phases can help your team optimize the cost of your experiments and maximize your budget.
This is an area of frequent concern. Companies are concerned about profits, they want to maximize resources to deliver value. Cloud services providers realize this also. Sireesha Muppala et al. share considerations about reducing training costs in Amazon SageMaker Debugger in their article. Microsoft Azure also allows engineers to determine the resource requirements of their model prior to deployment using the SDK.
This profiling tests the model with a provided dataset, and reports recommendations for resource requirements. So, it’s important that the provided dataset is representative of what might be served when the model goes into production.
Profiling models also offer other advantages outside of cost. Sub-optimal resources may slow down training jobs or introduce latency into the operation of the model in production. These are bottlenecks that machine learning teams must identify and fix quickly.
Monitor predictive service performance
So far, the practices listed above can help you continuously deliver a robust machine learning system. In operation, there are other metrics that determine the performance of your deployed model, independent of training/serving data and model type. These metrics are arguably as important as the familiar project metrics (such as RMSE, AUC-ROC etc.) that evaluate a model’s performance in relation to business objectives.
Users might need the output of machine learning models in real-time to ensure that decisions can be made quickly. Here, it’s vital to monitor operational metrics such as:
- Latency: measured in milliseconds. Are users guaranteed a seamless experience?
- Scalability: measured in Queries Per Second (QPS). How much traffic can your service handle at the expected latency?
- Service Update: How much downtime (service unavailability) is introduced during the update of your service’s underlying models?
For example, when fashion companies run advertising campaigns, poor service performance of an ML recommendation system can impact the conversion rates. Customers might get frustrated with service delay and move on without making a purchase. This translates to business losses.
Apache Bench is a tool from the Apache organization that lets you monitor these crucial metrics and make the right provisions for your organization’s needs. It’s important for these metrics to be measured across the different geographical locations for your service. Austin Gunter’s Measuring Latency with Apache Benchmark and this tutorial are also great introductions to this useful tool.
Think carefully about your choice of ML platforms
MLOps platforms can be difficult to compare. Still, your choice here can make or break your machine learning project. Your choice should be informed by:
- The team you have: the level of experience; subject matter experts or technical experts?
- Whether your project uses traditional machine learning or deep learning.
- The sort of data you will be working with.
- Your business objectives and budget.
- Technical requirements, such as how involved your model monitoring needs to be.
- The platform’s features and how they might evolve in the long run.
Several comparisons of ML platforms exist online to guide your choice, like Top 12 On-Prem Tracking Tools in Machine Learning. Neptune is one of the platforms discussed. It makes collaboration easy and helps teams manage and monitor long-running experiments, either on-prem or in web UI. You can check out its main concepts here.
Open communication lines are important
Implementing and maintaining a machine learning system long-term means collaboration between a variety of professionals: teams of data engineers, data scientists, machine learning engineers, data visualization specialists, DevOps engineers and software developers. UX designers and Product Managers can also affect how the product that serves your system interacts with users. Managers and Business owners have expectations that control how the performance of teams is evaluated and appreciated, while compliance professionals ensure that operations are in line with company policy and regulatory requirements.
If your machine learning system is going to keep achieving business objectives amidst evolving user and data patterns and expectations, then teams involved in its creation, operation and monitoring must communicate effectively. Srimram Narayan explores how such multidisciplinary teams can adopt an outcome orientation to their setup and approach to business objectives in Agile IT Organization Design. Be sure to add it to your weekend reads!
Score your ML system periodically
If you know all the practices above, it’s clear that you (and your team) are committed to instituting the best MLOps practices in your organization. You deserve some applause!
Scoring your machine learning system is both a great starting point for your endeavour and for continuous evaluation as your project ages. Thankfully, such a scoring system exists. Eric Breck et al. presented a comprehensive scoring system in their paper – What’s your ML Test Score? A rubric for ML production systems. The scoring system covers features and data, model development, infrastructure as well as monitoring.
Conclusion
And that’s it! The 10 practices you should definitely consider implementing are:
- Naming conventions
- Code quality checks
- Experiment — and track your experiments!
- Data validation
- Model validation across segments
- Resource utilization: remember that your experiments cost money
- Monitor predictive service performance
- Think carefully about your choice of ML platforms
- Open communication lines are important
- Score your ML system periodically
Try them out, and you’ll definitely see some improvement in your work on ML systems.