We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

5 Must-Do Error Analysis Before You Put Your Model in Production

The blossom of the deep learning era began in 2012 when Alex Krizhevsky created a convolutional neural network that boosted the accuracies in image classification by more than 10%. The drastic success was soon followed by other research domains and soon other businesses – both conglomerates and startups – hoped to apply this cutting-edge technology to their own products: banks now use ML models to detect fraud; autonomous driving adopts sensor results to make confident decisions automatically. 

However, the rapid shift from research to production often leads to gaps that are often ignored but yet are crucial to consider. Many companies judge their machine learning models solely based on lab results, and even plain accuracies, to increase the pace of showcasing their product lines, which may actually lead to large performance gaps or even unknown biases.

This article dives into realistic scenarios beyond surface accuracy performances of machine learning (ML) models that one should consider before putting them into production. Specifically, while the training accuracies of a particular model may seem convincing, it is important to analyze if the performance will maintain with a different dataset distribution. We provide five must-do analyses to make sure your model performs just as expected when it goes live.

Error analysis: preliminary knowledge

Before exploring why each error analysis is distinct and must be done, we first have to properly understand the core concepts of ML models to realize its inherent constraints. 

In the conventional engineering setting, we hope to design mathematical systems that map inputs to their designated outputs – such mathematical modelling is not always possible, especially when the system is unknown and too complicated to find. This is when machine learning (particularly supervised learning) plays an important role. Instead of creating the model, we learn it based on a set of known inputs and outputs.

To fully train and evaluate an ML model, a typical setting would be to separate the available dataset into training and validation sets, where only the training set is seen during the model setting and validation is used purely for evaluation. We usually base the performance on the validation set as the benchmark for how well our model is going to generalize under real-world settings. Thus, it is very important to understand how the validation dataset is constructed and how there might be a potential distribution shift when the model is put into production, as all the error analysis revolves around this concept.

Error analysis 1: Size of training and validation dataset

As mentioned in the previous section, how well the model performs is solely based on the training and validation datasets. This leads to the first and most important error analysis that everyone must do before production: determine whether the size of the training and validation dataset is sufficient. We can illustrate the importance of this with a dog and cat image recognition model. 

Suppose we hope to create an ML model that can determine whether the animal inside an image is a dog or a cat. To train the model, we need to collect a set of images for training and validation. Now consider a case where there are 50 species of dogs, and yet we only have images of 40 for training and validation. Our model might perform extremely well both on training and validation, but we would never know how well it can generalize to the 10 unknown species, which are valid examples when the model is put into production for real-world testing.

The case might sound too extreme, leading to one questioning whether it would actually happen in reality. However, a lot of real-world data comprises distributions much more complicated and beyond our knowledge – a model might not be exposed to a significant feature without us even noticing! The following further describes the issues that could incur when the training/validation set is too small.

  • When the training set is too small: we can think of the training set as a small population of the entire real-world dataset. Thus, if we want an ML model to be robust, we would hope that this small population takes samples from all the possibilities that real-world data might have. If not, the model is going to learn only the small distribution of data available, making it barely generalizable during testing.
  • When the validation set is too small: a different problem may occur when the validation set is too small. Instead of not being generalizable, a small validation set could mean that not all possibilities are tested. This could give false impressions of the performances either way; we may see a very good result if the present distributions are learned and vice versa.

What exactly accounts for “too small” is a difficult question as the answer is based on numerous unknown factors such as how difficult the task is and how complex the dataset distribution is. However, a rule of thumb is that if the image dataset has less than a few thousand images, or if a regression dataset has less than a few thousand entries, the model is usually not able to generalize well. If your ML model is deep learning based (requiring neural networks), you will definitely need more data due to the vast amount of parameters available for a deep learning model.

Solution

The solution to this is straightforward – increase the size of your dataset. Consider what are the realistic scenarios your model is going to be applied to. Collect more data under the same setting and put them into both training and validation datasets. While seemingly simple, this is actually what numerous companies (e.g., Google, Facebook) has been constantly trying to accomplish with their immense customer database: they constantly collect all the data (images to search history) as they are concretely aware of the benefits of the increase in dataset size is machine learning.

Afterward, consider the balance of your train and validation set. 80/20 is the standard split, but depending on how big your dataset is you could decrease the validation set to make the training more robust or vice versa to understand better how well your model is performing.

Error analysis 2: Balance of a dataset and accuracy per class

The second error analysis digs into the content of the dataset to find the balance of labels. We can go back to the abovementioned dog/cat classification task to illustrate this issue. 

Consider a dataset, where out of the 1000 images, 990 of them are dogs and the remaining 10 are cats. If a model learns to classify everything as dogs (which is highly inaccurate and basically useless), it would obtain a 99% accuracy on the entire dataset. Without looking deeply into the formation of errors, many would bluntly assume that the model is well trained and ready for production.

This problem can be easily circumvented by diving into the number of samples per class. In the optimal scenario, we would want each individual class to have roughly the same number of data. If such a setting is not possible, then we should at least focus on having a sufficient number of data for each class (a few hundred).

However, even when the dataset is completely balanced, the network could also perform better on certain classes and worse on others, or perform better at detecting positives than negatives if your task is a binary classification. To be aware of this, we must not just take the accuracy directly, but to look at the per-class accuracy and true/false positive/negatives through different metrics. Below are some of the metrics that can be adopted for further analysis beyond accuracy.

Error analysis 2: Balance of a dataset and accuracy per class
  • F1 score: for binary classification, we can compute an F1 score (formula above) which can allow us to understand the precision and recall of our given model. Note that we can also specifically take out the precision and recall individually for evaluation. Depending on the task at hand, we could potentially want to maximize precision or recall and give up the other. For example, if your goal is to find a set of target customers for your product, you may want the recall to be big so that you don’t miss a potential customer, but precision may suffer if you find customers that aren’t exactly your target.
Example of a Confusion Matrix on Iris Dataset to show the percentages of predictions for each label
Example of a confusion matrix on iris dataset to show the percentages
of predictions for each label | Source
  • A confusion matrix: (figure above) shows the percentage of every class being classified as all the classes. Accompanied with color coding, this matrix becomes a great visualization when we try to understand what class is more errorsome and how the error is constructed.
  • ROC curve: If we are doing classification, we can vary the threshold to distinguish between positive and negative to construct a curve. This will give us a better understanding of which threshold is better.

Solution

If you encounter a class having dramatically less data or performing significantly worse than others, it is sensible to increase the number of data for this particular class. Tricks such as data augmentations could also be used as substitutes if real-world data is not obtainable (more regarding data augmentation is addressed in the latter section). In addition, we could also apply random weight sampling on the data or directly hardcode sampling weights to each sample to counter the imbalance of datasets.

Error analysis 3: Fine-grained misclassification errors

Now that all the numbers you can possibly compute have shown good results, it is still necessary to perform some rough qualitative analysis into the fine-grained classification results to understand what is causing the errors exactly.

Oftentimes, some errors might be very apparent and yet not noticeable from merely the class information. For example, we may notice in our dog/cat problem that all the white cats are misclassified through simply visualizing all the misclassified cats, and yet the overall cat category performs very well. This error analysis becomes crucial as white cats may be a predominant part of real-world cases (again a problem of distribution difference between validation and real-world testing). However, since labels itself does not reflect such a problem at all, the only way to somehow investigate this is via qualitative analysis with human-in-the-loop.

Solution

The solution to fine-grained misclassifications is very similar to the solution of imbalance datasets, where a direct and only solution will be to add more to the dataset. However, to target a particular fine-grained class, we should only add to the sub-class that performed poorly.

Error analysis 4: Investigate the level of overfitting

Overfitting happens when the model learns patterns/data distributions that only occur in the training set. Using these patterns will thus degrade the performance of the model when shifting the model to testing.

Overfitting may occur in two ways, one more common and one less seen. The following depicts the two problems and their respective solutions.

Overfitting to Training Set

This happens when the network learns something from the training set that does not apply to the validation set. The problem is easy to observe with the help of plotting a loss curve for both training and validation after every epoch. If the training loss continues to drop after the validation becomes stagnant (leading to a widening performance gap), then overfitting is happening.

Solution

There are multiple solutions to training set overfitting, below we list 6 viable solutions.

  • Early stopping: we plot the training and validation loss after each epoch/a a certain number of iterations. If the validation loss stops decreasing for a certain number of epochs, we stop training completely. This prevents the model from learning unwanted patterns in the training set.
  • Reducing the complexity of the ML model: when a model is too complex (too many parameters), it is likely to learn patterns especially when the task at hand is actually simple (e.g., if we are learning a function y = wx, but we have 100 parameters {w1, w2 …., w100}, then it is very likely w2 to w100 actually learns the noises in the data instead of the true function). If you are training a neural network, this is equivalent to decreasing the depth and width of the layers. If you are training decision trees/forests, it would be to limit the depth or number of trees.
  • Adding L1/L2 regularisations: these are regularisations preventing a particular parameter to be too large, which is often the case when overfitting happens. Specifically, L1 regularisation aims to minimize the absolute magnitude of each weight, whereas L2 regularisation minimizes the squared value of each weight. The square curve will thus push all values to a state near 0. In essence, both regularisations ensure that each factor contributes similarly to the final prediction.
Comparison of a neural network with and without dropout. Picture retrieved from the original dropout paper from JMLR
Comparison of a neural network with and without dropout | Picture retrieved
from the original dropout paper from JMLR
  • Adding dropouts: for a neural network, we can randomly drop out certain neurons and not take into account their outputs. This will allow our model to be less sensitive to slight changes of the input data, thus preventing overfitting.
Visualisation of a data augmentation technique called “mixup”
Visualization of a data augmentation technique called “mixup” | Source: Author
  • Data augmentation: adding more training data will make it less likely for a certain pattern to exist in the training set and not in the validation set. Data augmentation is the technique of synthesizing data from current available entries and using it as additional data for training. For example, we can crop, rotate, and shift images and regard them as “new” images to add to training and make the model more robust. Some more complicated and up-to-date augmentation techniques also include mixup, cutout, cutmix, etc.
  • Ensemble Learning: if computational resources are available, one could train multiple models and combine their predictions together to make the final prediction. It is less likely that all models overfit to the same patterns. The predictions will therefore be smoother, further preventing overfitting. In reality, there are multiple ways to ensemble models, from simple bagging and boosting to more complicated deep model combinations – all of which depend on the scenario and models used for the case at hand.

Overfitting to validation set

This is less of a common problem, as only training data is actually used for tuning the model. However, if you perform early stopping based on validation, there could still be a chance that you are actually selecting the best model for that particular set and not for the entire true population.

Solution

One way to prevent this is to adopt multiple validations through methods such as cross-validation. Cross-validation divides the entire dataset into multiple segments, and every time we take one of the segments and use it for validation and the remaining for training. In addition, we should also make sure that the validation set is big enough and that when we manually increase the set size the accuracies remain roughly similar.

Error analysis 5: Mimicking the scenario where the model will be applied

Finally, after all the detailed analyses you could do with your training and validation set, it is time to mimic the scenario of how your model will be applied and test for realistic results.

We can once again illustrate this through the dog/cat classification problem. Suppose you are creating an app for iPhone users to classify dogs and cats; you trained your model based on a bunch of dog and cat images you obtained online with careful analysis. For this final part, you would then want to mimic the realistic scenario by testing on different iPhone cameras. In other words, we want to make sure that the distribution gap between what you trained and validated your model on is small when evaluated on the testing set.

Solution

As this is the final stage of the error analysis, and assuming that everything else is suggesting that your model has great potential you could potentially put your model into production but add an online learning mechanism to continue the model’s fine-tuning.

Online learning is a learning mechanism where you constantly improve the model on-the-fly by adding new testing data into training. As more and more data is added to finetune the model, the accuracy could potentially increase as the model will begin to fit more to the test distribution you are ultimately targeting towards. Numerous applications (e.g., Siri, Facial Recognition) actually adopts this scheme, which is how they slowly adapt to your look and voice as you continue to use your phone.

A failure case

With only the theoretical backing of the aforementioned error analyses, one may actually wonder upon the practicality and thus the necessity of such data analysis: will a machine learning model really go wrong without these fixes? To answer this, we can actually dig into two classic cases of an AI tool failure: the bias of Amazon AI recruitment in 2014 and FB advertisements in 2019.

Amazon AI recruitment

To expedite the recruitment process, Amazon was working on an AI-driven recruiting tool to review and filter out resumes in 2014. Yet, after a year into production, they realized in 2015 a massive problem of gender bias despite not having gender as part of the input factor, especially for roles such as software developer or other technical roles.

How did this happen? 

The answer to this actually lies deep down in the roots of how the model dataset is collected in the first place: using the previous resumes from the past 10 years, which were ultimately male dominant. In fact, top US tech companies have a male to female headcount ratio at around 3:1 for technical roles.

Tech Industry is dominated by men
Tech Industry is dominated by men | Source

Because of this, the model was inherently given the false impression that male candidates are preferred and thus got associated more with words related to the male gender and demolished the value of resumes containing words such as “women’s chess club captain”. 

How can we avoid this?

We can actually refer this case back to errors 2 and 5, where the strong imbalance in the dataset (male vs female ratio) and failure to mimic what the real evaluation scenario would be. Thus, to avoid these critical errors before production, one must consider the balance of the datasets and potentially add weights to samples/augmentations to particular classes in order to avoid such issues.

Facebook advertisement bias

Similarly, Facebook launched a machine learning and recommendation system, and yet such recommendations in the end excluded women from seeing certain advertisements and vice versa due to the general population generalization and associated stereotypes, as suggested by research from the University of Southern California

How can we avoid this?

Again, such a problem would be much less likely to occur with a detailed and rigorous error analysis by balancing out and finding fine-grained errors through extensive testing, while simultaneously mimicking scenarios that might happen in the real world by having a set of test users before actually putting them into mass production.

Endnote

And there you have it! Five different error analyses one must do before you can say that your model is fully ready to be put into production. Here are a few other useful articles that you can further dive into for making your machine learning model effective and respectable in production:

Remember that being careful at the testing stage could potentially lead to a lot of effort saved!


READ NEXT

ML Model Testing: 4 Teams Share How They Test Their Models

10 mins read | Author Stephen Oladele | Updated March 1st, 2022

Despite the progress of the machine learning industry in developing solutions that help data teams and practitioners operationalize their machine learning models, testing these models to make sure they’ll work as intended remains one of the most challenging aspects of putting them into production. 

Most processes used to test ML models for production usage are native to traditional software applications, not machine learning applications. When starting a machine learning project, it’s standard for you to take critical note of the business, tech, and datasets requirements. Still, teams often neglect the testing requirements for later until they are either ready to deploy or altogether skip testing before deployment. 

How do teams test machine learning models?

With ML testing, you are asking the question: “How do I know if my model works?” Essentially, you want to ensure that your learned model will behave consistently and produce the results you expect from it. 

Unlike traditional software applications, it is not straightforward to establish a standard for testing ML applications because the tests do not just depend on the software, they also rely on the business context, problem domain, dataset used, and the model selected. 

While most teams are comfortable with using the model evaluation metrics to quantify a model’s performance before deploying it, these metrics are mostly not enough to ensure your models are ready for production. You also need to perform thorough testing of your models to ensure they are robust enough for real-world encounters.

This article will teach you how various teams perform testing for different scenarios. At the same time, it’s worth noting that this article should not be used as a template (because ML testing is problem-dependent) but rather a guide to what types of test suite you might want to try out for your application based on your use case.

Continue reading ->
Feature store and data ingestion mlops

How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack

Read more
ML pipeline problems solutions

Building ML Pipeline: 6 Problems & Solutions [From a Data Scientist’s Experience]

Read more
Recommender system lessons

Recommender Systems: Lessons From Building and Deployment

Read more
MLOps pillars

Pillars of MLOps and How to Implement Them

Read more