Machine learning is on the rise. With that, new issues keep popping up, and ML developers along with tech companies keep building new tools to take care of these issues.
If we look at ML in a very basic way, we can say that ML is conceptually software with a bit of added intelligence but unlike traditional software ML is experimental in nature. Compared to traditional software development, it has some new components in the mix, such as: robust data, model architecture, model code, hyperparameters, features, just to name a few. So, naturally, the tools and development cycles are different, too. Software had DevOps, machine learning has MLOps.
If it sounds unfamiliar, here’s a short overview of DevOps and MLOps:
DevOps is a set of practices for developing, testing, deploying, and operating large-scale software systems. With DevOps, development cycles became shorter, deployment velocity increased, and system releases became auditable and dependable.
MLOps is a set of practices for collaboration and communication between data scientists and operations professionals. Applying these practices increases end-quality, simplifies the management process, and automates the deployment of machine learning and deep learning models in large-scale production environments. It makes it easier to align models with business needs and regulatory requirements.
The key phases of MLOps are:
- Data gathering
- Data analysis
- Data transformation/preparation
- Model development
- Model training
- Model validation
- Model serving
- Model monitoring
- Model re-training
We’re going to do a deep dive into this process, so grab a cup of your favorite drink and let’s go!
What is Machine Learning Model Management?
Model management is a part of MLOps. ML models should be consistent, and meet all business requirements at scale. To make this happen, a logical, easy-to-follow policy for model management is essential. ML model management is responsible for development, training, versioning and deployment of ML models.
Note: Versioning also includes data, so we can track which dataset, or subset of the dataset, we used to train a particular version of the model.
When researchers work on novel ML models, or apply them to a new domain, they run countless experiments (model training & testing) with different model architectures, optimizers, loss functions, parameters, hyperparameters and data. They use these experiments to get to the best model configuration which generalizes the best, or has the best performance-to-accuracy compromise on the dataset.
But, without a way to track model performance and configurations in different experiments, all hell can (and will) break loose, because we won’t be able to compare and choose the best solution. Even if it’s just one researcher experimenting independently, keeping track of all experiments and results is hard.
That’s why we do model management. It lets us, our teams and our organizations:
- Proactively address common business concerns (such as regulatory compliance);
- Enable reproducible experiments by tracking metrics, losses, code, data and model versioning;
- Package and deliver models in repeatable configurations to support reusability.
Why does Machine Learning Model Management matter?
As I mentioned previously Model Management is a fundamental part of any ML pipeline (MLOps). It makes it easier to manage the ML life-cycle from creation, configuration, experimentation, tracking the different experiments, all the way to model deployment.
Now, let’s go a little bit deeper, by making a clear distinction between different parts of ML Model Management. It is important to notice that within ML Model management we manage two things:
- Models: Here we take care of model packaging, model lineage, model deployment & deployment strategies (A/B testing and etc), monitoring and model retraining (happens when the deployed model’s performance drops below a set threshold).
- Experiments: Here we take of logging training metrics, loss, images, text or any other metadata you might have as well as code, data & pipeline versioning,
Without model management, data science teams would have a very hard time creating, tracking, comparing, recreating, and deploying models.
The alternative to model management are ad-hoc practices, which lead researchers to create ML projects that are not repeatable, unsustainable, unscalable and unorganized.
Now, besides that according to research conducted by AMY X. ZHANG∗, MIT et al. on how DS workers collaborate show that teams of DS workers collaborate extensively on leveraging ML to extract insights from data, as opposed to individual data scientists working alone. And in order to collaborate effectively they employ the best collaborative practices (i.e. documentation, code versioning and so on) and tools between team members with prior being highly dependent on the latter.
MLOps facilitates collaboration but most of today’s understanding of data science collaboration only focuses on the perspective of the data scientist, and how to build tools to support globally dispersed and asynchronous collaborations among data scientists, such as version control of code. The technical collaborations afforded by such tools only scratch the surface of the many ways that collaborations may happen within a data science team, such as:
- When stakeholders discuss the framing of an initial problem before any code is written or data collected
- Commenting on experiments
- Taking over someone elses notebook or code as a baseline to built upon
- Researchers and Data Scientist train, evaluate and tag models so that an MLE knows that a model should be reviewed (i.e. A/B testing) and promoted to production (model deployment)
- Having a shared repository where business stakeholders can review production models (A.K.A Model Registry)
What is the extent of collaboration on data science teams?
Rates of Collaboration: Among the five data science roles of Figure above, three roles reported collaboration at rates of 95% or higher. As you can clearly see, these roles are the core roles in a ML team.
The study also shows that Researchers, Data Scientist, ML Engineers collaborate extensively and play a key role throughout the development, training, evaluation (i.e. Accuracy, performance, bias) versioning and deployment of ML models (ML Model Management)
Not convinced yet? Here are six more reasons why model management matters:
- Allows for a single source of truth;
- Allows for versioning of the code, data, and model artifacts for benchmarking and reproducibility;
- It’s easier to debug/mitigate problems ( i.e. overfitting, underfitting, performance and/or bias) — thus making the ML solution easily traceable and compliant with the regulations;
- You can do faster, better research and development;
- Teams become efficient and have a clear sense of direction.
- ML Model management can facilitate intra team and/or inter team collaboration around code, data and documentation through the use of various best practices and tools (JupyterLab, Colab, Neptune.ai, MLflow, Sagemaker and etc);
ML Model Management components
Before we continue here is a glossary of the common components of ML Model Management workflow:
- Data Versioning: Version control systems help developers manage changes to source code. While data version control is a set of tools and processes that tries to adapt the version control process to the data world to manage the changes of models in relationship to datasets and vice-versa.
- Code Versioning/Notebook checkpointing: It is used to manage changes to the model’s source code.
- Experiment Tracker: It is used for collecting, organizing, and tracking model training/validation information/performance across multiple runs with different configurations (lr, epochs, optimizers, loss, batch size and so on) and datasets (train/val splits and transforms).
- Model Registry: Is simply a centralized tracking system for trained, staged and deployed ML models
- Model Monitoring: It is used to track the models inference performance and identify any signs of Serving Skew which is when data changes cause the deployed model performance to degrade below the score/accuracy it displayed in the training environment.
Now that we know the different components of model management and what they do, let’s look into some of the best practices.
Best practices for Machine Learning Model Management
The following is a list of ML model management best practices:
- Keep the first model simple and get the infrastructure right
- Starting with an interpretable model makes debugging easier
- Capture the Training Objective in a Metric that is Easy to Measure and Understand
- Actively Remove or Archive Features That are Not Used
- Peer Review Training Scripts
- Enable Parallel Training Experiments
- Automate Hyper-Parameter Optimisation
- Continuously Measure Model Quality and Performance
- Use Versioning for Data, Model, Configurations and Training Scripts
- Plan to launch and iterate
- Automate Model Deployment
- Continuously Monitor the Behaviour of Deployed Models
- Enable Automatic Rollbacks for Production Models
- When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals
- Enable Shadow Deployment
- Keep ensembles simple
- Log Production Predictions with the Model’s Version, Code Version and Input Data
- Human Analysis of the System & Training-Serving Skew
ML Model Management vs Experiment Tracking
Experiment tracking is a part of model management, so it’s also a part of the larger MLOps approach. Experiment tracking is about collecting, organizing, and tracking model training/validation information across multiple runs with different configurations (hyperparameters, model size, data splits, parameters, etc).
As mentioned earlier, ML/DL is experimental in nature, and we use experiment tracking tools for benchmarking different models.
Experiment tracking tools have 3 main features:
- Logging: log experiment metadata (metrics, loss, configurations, images and so on);
- Version Control: track both data and model versions, which is very useful in a production environment and can help with debugging and future improvements;
- Dashboard: visualize all logged and versioned data, use visual components (graphs) to compare performance and rank different experiments.
How to implement ML Model Management
Before we move on, let me tell you a short story.
Last year I had a lot of problems with some of my customers because I didn’t track my experiments:
- I couldn’t compare different experiments effectively and I did everything from my memory, so projects got delayed.
- I relied heavily on ensembling to try to patch the flaws of the individual models which only partially worked but mainly led nowhere.
- Not logging the results of experiments also created problems long term, where I couldn’t recall the performance of previous versions of the model.
- Deploying the right model was tricky because it was never clear which one was the best, which code, transformations and data was used.
- Reproducibility was impossible.
- CI/CD and CT were impossible to implement with such artisanal Model Management.
I did some research, found out about ML model management, and decided to try an actual experiment tracking tool to speed up my process. Now, I don’t even start a project without my favorite experiment tracking tool, Neptune.
Note: During the creation of this article Neptune.ai’s flagship product as evolved past a simple experiment tracking tool and has become something much bigger now, more on this in the tools section.
I keep using it both in production and research, such as custom ML model projects I develop for my customers, and in my final year CSE degree project.
There are many other tools out there, some of which are full-blown platforms for managing the whole ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. We’ll discuss these tools in a bit.
So, after using an experiment tracking tool coupled with a model lifecycle platform (in my case, MLflow) on projects with different scale and needs, I found 4 ways of implementing ML model management:
- Logging + Model and Data version control
- Logging + Code, Model and Data version control
- Logging + Code, Model and Data version control + Model deployment and monitoring
I call this ad-hoc research model management. At this level, you’re just using an experiment tracking tool for logging. Great for beginners starting out with ML, or advanced researchers doing rapid prototyping to prove if a hypothesis is worth pursuing.
This level allows individuals, teams, and organizations to record and query their experiment:
- Metrics (accuracy, IoU, Bleu score and so on)
- Loss (MSE, BCE, CE and so on)
- Config (parameters, hyperparameters)
- Model performance results from training and testing
- Ad-hoc data science
- Research- and rapid prototype-driven
- No data versioning
- No model versioning
- No notebook checkpoint
- No CI/CD Pipeline
- Lack of Reproducibility
Usually, us data scientists enjoy running multiple experiments to test different ideas, code and model configurations and datasets. At this level, this is quite challenging.
- First, you don’t follow any DS Project Management methodology that will give you a clear direction. Therefore, without standardised methodologies for managing data science projects, you will often rely on ad hoc practices that are not repeatable, not sustainable, and unorganized.
- Second, datasets are constantly being updated, so even though you log the metrics, loss and configuration, you don’t know which version of the dataset was used to train a specific model.
- Third, code also might change with each experiment run, so despite saving all the model configuration, you might not know which code was used in which experiment.
- Fourth, even if you save the models weights, you might not know which model was trained using a specific configuration and dataset.
All of these challenges make it impossible to reproduce the results of any particular experiment. In order to address the challenges of this level, a good start is to add versioning to our models and data – some experiment tools do this out-of-the-box. This way we can make partial reproducibility possible.
I call this partial model management. Generally used for well-structured teams doing rapid prototyping. At this level, besides experiment tracking, you’re also storing the model and its metadata (configuration), as well as the dataset or data split used to train it, in a central repository that will be used as a single source of truth.
- Has data versioning
- Has model versioning
- Experiments are partially reproducible
- Ad-hoc data science
- Research and rapid prototype-driven
- No CI/CD pipeline
- Lack of reproducibility
- No notebook checkpointing
This level is good for testing ideas quickly without fully committing to any of them. It might work great in a research setting, where the goal is to just try out interesting ideas, and compare the experiment across different individuals, teams or companies. We’re not yet thinking about shipping them to production.
Although we can reproduce the experiment from the model metadata and dataset used to train it, at this level we still haven’t fully solved reproducibility. We just partially solved it. In order to go full circle, we need one more component – notebook checkpointing, so that we can track code changes.
I call this semi-complete model management. It’s great for individuals, teams and companies who want to not only quickly test their hypothesis, but also deploy their models to a production environment.
This level allows individuals and organizations to keep a full history of experiments by storing and versioning their notebooks/code, data and model, besides just logging metadata. This takes us full circle, making reproducibility a reality and easy to achieve regardless of the ML/DL frameworks or toolset used. At this level, you usually also apply standardised methodologies for managing data science projects.
- Has data versioning
- Has model versioning
- Has notebook checkpointing
- Experiments are fully reproducible
- Coupled with a DS project management approach
- No CI/CD pipeline
You have automated everything at this level, except one thing: model deployment. This creates stress. Every time you have a new trained model ready for deployment, you have to manually deploy it. In order to complete the ML model management pipeline, you need to integrate CI/CD.
I call this end-to-end model management. At this level, you have a completely automated pipeline, from model development, versioning, to deployment. This level offers a production-grade setup, and is great for individuals, teams and organizations looking for a complete, automated workflow. Once you set it up, you don’t have to do ops work anymore. You can focus on tweaking and improving the model and data sources.
- Has data versioning
- Has model versioning
- Has notebook checkpointing
- Experiments are fully reproducible
- Coupled with a DS project management approach
- CI/CD pipeline
- No CT pipeline
There is only one thing missing at this point – a way to continuously monitor deployed models. Also known as a CT (continuous testing) pipeline, it’s used to monitor a deployed model, and automatically retrain and serve a new model if the currently deployed model’s performance drops below a set threshold. Let’s take a computer vision model, like ResNet, in a production environment. In order to add CT, it would be as simple as monitoring and logging the following:
To add this functionality to the mix, you can re-use the same code from Level-0 or Level-1 for logging metadata during training, and use it for inference.
Tools like Neptune and MLflow let you install their software locally, so you can add this capability to your deployment server. Neptune is more robust here, and offers a second option with a lightweight web version of their software for both individuals and teams, so no need to install and configure anything just create a new project on their dashboard. Just Add a few lines to your deployment code and it’s done.
Building vs using existing ML Model Registry tools
What is Machine Learning (ML) model registry?
An ML model registry is simply a centralized tracking system for trained, staged and deployed ML models. It also tracks who created the model, as well as the data used to train it. It does this by using a database to store model lineage, versioning, metadata and configuration.
It’s relatively easy to build your own simple model registry. You can do it by using a few native or cloud services like the AWS S3 Bucket, RDBMS (Postgresql, Mongo…), and writing a simple python API to make it easy to update the database records in case changes or updates.
Although relatively easy to build a model registry does it mean you should do it? Is it really worth your time, money and resources?
To answer this questions let’s first look at the reasons why you might want to build your own ML model registry:
- Privacy: Your data can’t leave your premises.
- Curiosity: Like me, you enjoy building things.
- Business: You run or work for a company that builds ML tools, and you want to add it to an existing product, or as a new service for customers.
- Cost: Existing tools are too expensive for your budget.
- Performance: Existing tools don’t meet your performance requirements.
All valid reasons, except maybe cost, because most existing tools are open-source or freemium.
If your concern is performance, some tools offer great performance because they offer dedicated cloud server instances, with very little setup on your part (like Neptune and Comet).
Now, if your concern is privacy, most tools also offer an on-prem version of their software, which you can download and install in your organisation’s server to get full control over the data coming in and out. This way you can comply with laws and regulations, and keep your data safe.
In my honest opinion I think there is a common misconception when it comes to build vs buy. Something which usually more mature teams/devs understand right off the bat, but the ML community at large still doesn’t really get it.
The cost of hosting, maintaining, documenting, fixing, updating and adjusting the open-source software is usually orders of magnitude larger than the cost of vendor tools.
The thing is, it is usually relatively easy to build a simple, not-scalable and not-documented, system for yourself.
… but going from this to a system that you can have your entire team work on it very quickly becomes awfully expensive.
Also when you decide to build it (not even open-source) you will end up with someone needing to build/maintain it and ML engineers and devops folks salaries are not cheap.
Generally there is a good rule of thumb -> if the system (like ML model registry) is not your core business, and it usually isn’t than you should focus on your core business (for example building models for autonomous cars) and hire/buy a solution for the part that you don’t build your competitive advantage on.
Think of it this way, would you go and build a gmail because you can?
Or mail-sending system like mailchimp?
Or CMS like wordpress?
Some companies do, even though it is not their business. And it is usually a big mistake as you are focusing on building shovels rather than digging for gold :). .
It makes the most sense to build your own model registry if it’s just for fun or if you run a company that builds MLOps tools like MLflow or Neptune.
In many other cases, it’s just not a reasonable investment.
Companies have invested billions of dollars to create great, free and/or premium tools. Most of these tools you can easily extend to fit your own use case, saving your time, money, resources and headaches.
Now, let’s take a detailed look at some of the most popular tools.
Tools for Machine Learning Model Management
Keep in mind, I have my personal preference when it comes to the tools described below, but I tried to be as objective as possible.
MLflow is an open-source platform for managing the whole machine learning lifecycle (MLOps). Experimentation, reproducibility, deployment, central model registry, it does it all. MLflow is suitable for individuals and for teams of any size.
The tool is library-agnostic. You can use it with any machine learning library, and any programming language.
Launched in 2018, MLflow quickly became the industry standard because of its easy integration with major ML frameworks, tools, and libraries such as Tensorflow, Pytorch, Scikit-learn, Kubernetes and Sagemaker, just to name a few. It has a big community of users and contributors.
MLflow has four main functions that help track and organize experiments and models:
- MLflow Tracking – an API and UI for logging parameters, code versions, metrics, and artifacts when running machine learning code, and for later visualizing and comparing the results;
- MLflow Projects – packaging ML code in a reusable, reproducible form to share with other data scientists or transfer to production;
- MLflow Models – managing and deploying models from different ML libraries to a variety of model serving and inference platforms;
- MLflow Model Registry – central model store to collaboratively manage the full lifecycle of an MLflow model, including model versioning, stage transitions, and annotations.
- Experiment tracking
- Easily integrates with other tools and libraries
- Intuitive UI
- Big community and great support
- Free managed service option (MLflow Community edition) with preconfigured ML environments that includes: Pytorch, TF keras and other libraries.
- MLflow community is great for and limited to individuals, while the paid is for teams.
- Paid managed service option through cloud providers that come with pre-configured compute and sql storage servers. And this service billed per second.
- Lack of data pipeline versioning
- MLflow models could be improved
- MLflow Community experiment comparison UI could be simplified
- MLflow Community is slow to load and cannot be used for production
- MLflow Community has a single cluster limited to 15GB and no worker nodes
- MLflow Community edition does not include the ability to run and reproduce MLflow Projects. And its scalability and uptime guarantees are limited.
Neptune is now a metadata store for MLOps, built for research and production teams that run a lot of experiments.
It gives you a central place to log, store, display, organize, compare, and query all metadata generated during the machine learning lifecycle.
Thousands of ML engineers and researchers use Neptune for experiment tracking and model registry both as individuals and inside teams at large organizations.
It’s very flexible software that works with many of the other frameworks and libraries.
Neptune has all the tools you need for team collaboration and project supervision.
With Neptune you can replace folder structures, spreadsheets, and naming conventions with a single source of truth where all your model building metadata lives.
- Data versioning
- Model registry
- Experiment Tracking
- You can install the server version on your own hardware, or use it as a service through the client libraries
- Provides user and organization management with organizations, projects, and user roles
- Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views, and share them with the team
- You can use a hosted app to avoid all the hassle with maintaining yet another tool (or have it deployed on your on-prem infrastructure)
- Your team can track experiments which are executed in scripts (Python, R, other), and notebooks (locCoal, Google lab, AWS SageMaker)
- Extensive experiment tracking, visualization and comparison capabilities;
- Easily integrates with other tools
- Offers a free managed service option with 100 GB storage, unlimited experiments, private and public projects
- Free managed service option which is great for and limited to individuals, while there are paid options for small and big teams.
- The free managed service is ready for production and you can reproduce experiments
- Paid managed service option:
- Lack of data pipeline versioning
- It would be great to have more advanced API functionality.
- The UI for creating graphs with multiple lines could be more flexible.
Amazon Sagemaker is an end-to-end MLOps platform, with a suite of tools to gather, prepare, and version data, as well as build, train, version, deploy and monitor ML models.
I use this platform for some parts of my ML pipeline, like training big models, or when I’m running performance-demanding tasks (image or text processing). From my experience, I can say this for beginners: Sagemaker is far from easy to use for beginners, it has a steep learning curve, but once you gain some experience it becomes amazingly useful and helpful, like a Swiss-army Jedi knife.
- Pay for what you use
- Provides centralised control
- Provides a build-managed Jupyter notebook
- Provides great freedom and customisation (you can integrate other tools like MLflow, Neptune and so on)
- Offers an easy and straightforward way to build training and test samples
- Uses conda for package management
- Great hardware options
- All training, testing, and models are stored on S3, so it’s very easy to access
- Easy API deployment
- Data pipeline versioning
- Unclear and lengthy documentation
- Experiment tracking logs could be managed better
- No model comparison UI
- Steep learning curve
- Lots of boilerplate code
Azure Machine Learning is a cloud MLOps platform from Microsoft. Like Sagemaker, it’s an end-to-end tool for the complete ML lifecycle.
- Pay for what you use
- It lets you create reusable software environments for training and deploying models
- It offers notifications and alerts on events in the ML lifecycle
- Great UI and user-friendliness
- Extensive experiment tracking and visualization capabilities
- Great performance
- Connectivity, it’s super easy to embed R or Python codes on Azure ML, if you want to do more advanced stuff, or use a model that’s not yet available on Azure ML, you can simply copy+paste the R/Python code
- Costly for individuals
- More models could be available
- Difficult to integrate the data for creating the model
- Hard to integrate with other tools
- You always need a good internet connection to use it.
Machine Learning Model Management is a fundamental part of the MLOps workflow. It lets us take a model from the development phase to production, making every experiment and/or model version reproducible.
Finally, to recap, there are 4 levels of ML model management:
- Level-0, ad-hoc research model management
- Level-1, partial model management
- Level-2, semi-complete model management
- Level-4, complete (end-to-end) model management
At each level, you will be faced with different challenges. The best practices of ML model management are centered around 3 components:
As far as tools go, we have a plethora to choose from, but in this article I described a few popular ones:
- Neptune AI
- Azure ML
I hope this helps you choose the right tool.
With that, thank you for reading this article, and stay tuned for more. As always, I created a long list of references for you to explore, have fun!
Build your own model registry
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
Jakub Czakon | Posted November 26, 2020
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->