MLOps Blog

How to Improve ML Model Performance [Best Practices From Ex-Amazon AI Researcher]

Sundeep Teki , Aayush Bajaj

12 min

14th May, 2024

ML Model Development

Machine learning and deep learning models are everywhere around us in modern organizations. The number of AI use cases has increased exponentially with the rapid development of new algorithms, cheaper computing resources, and greater data availability. Every industry has appropriate machine learning and deep learning applications, from banking to healthcare to education, manufacturing, construction, and beyond.

Model improvement is one of the biggest challenges in all these ML and DL projects in different industries. Since Amazon launched its product recommendations, it has undergone tremendous upgrades. For example, in the early 2000s, they moved their product recommendations from user-based to item-to-item collaborative filtering. This switch to a different algorithmic approach dramatically improved the quality of recommendations and was vital to their rise to the e-commerce powerhouse we see today.

Another way that Amazon constantly improves their product recommendation system is to collect more data about products and customer behavior. This data comprises past purchases, browsing history, and search queries. The more data Amazon collects, the more extensive and diverse the examples they can train their machine learning models on. This leads to models learning more robust and generalized patterns, reduces overfitting, and improves their generalizability – their ability to make accurate predictions on unseen data.

In this article, I will show you a range of techniques to optimize the task performance of machine learning models that I’ve used while working on AI at Amazon. We’ll start by discussing how to uncover if your model needs improving and which measures are likely to yield the biggest performance gain. Then, we’ll dive into the details of the different approaches. While most of them will apply to almost any kind of machine learning or deep learning, where I can, I’ll also share my experience with specific use cases.

As a bonus, at the end of the article, you’ll find a step-by-step guide summarizing everything you can use as a blueprint for your model improvement projects.

How to increase the accuracy of your machine learning model

There’s no one-size-fits-all strategy for improving machine learning and deep learning models. It all comes down to the business problem, the available data, and the type of algorithm. Ultimately, practice and experience working on a wide variety of models lead to better intuition about the best approaches in a given scenario.

But in any case, the first step is to identify shortcomings and potential areas of improvement.

Before diving into the analysis, getting a basic idea of the ML problem is invaluable. In the table below, I’ve collected the questions I like to ask myself before starting work on a model. I’ve also added some example answers.

Question

Examples

What is the business use case?

Product recommendations, Price estimation, Personalizing marketing messages, Detecting defects in materials

What type of ML problem are we dealing with?

Recommendation, Search, Forecasting

What are the business metrics?

Net promotor score (NPS), Click-through rate (CTR), Monthly recurring revenue (MRR), Lifetime-value (LTV)

What are technical metrics we can use to assess the model’s performance?

F1-score, Latency, Accuracy, Precision

What kind of data is being used?

Structured (numerical, categorical, time-series), unstructured images, text, audio, video), semi-structured (invoices, forms, report cards), graph data

What is the machine learning problem formulation?

Regression, classification (binary, multi-class, or multi-label), clustering, similarity

What are the relevant ML or DL model types?

Random forest, XGBoost, CNNs, DNNs, Transformers

Are there any additional constraints or potential issues?

Explainability, fairness, bias, privacy

When setting out to optimize a model, it’s paramount that you clearly identify what you’re trying to improve. Otherwise, you risk losing focus and getting nowhere at all.

For example, consider the case of a machine learning model for identifying fraudulent orders. That’s a sensitive task, as mistakenly turning away a customer hurts your business, just as erroneously shipping expensive goods to a fraudster would. Thus, it’s understandable that stakeholders want to ensure the model’s predictions are sound and it’s clear what suspicious features the model identified.

In this instance, it’s evident that optimizing the F1 score or augmenting the input dataset will not get you closer to the goal. Both will make your model better, but not in the required way. But it’s easy to lose sight of that once you get going.

So, let’s take a step back and ask:

How do you know if your ML model needs improvement

The first step in improving machine learning models is to carefully review the underlying hypotheses for the model in the business context and evaluate the current models’ performance.

It’s essential to understand how the model fails to perform adequately. Are there too many false positives? Is the accuracy too low? Are there specific kinds of objects the model has a hard time spotting in images?

You should also crunch some numbers and figure out the impact an improved model will have. Is increasing the precision by a few percentage points sufficient? Or would you need to beat the state-of-the-art to even see a marginal improvement?

Figuring out the answers in collaboration with the business stakeholders will enable you to better estimate the required effort and help you set realistic expectations. (And sometimes, it turns out that it’s not the machine learning model that needs improvement after all).

Aside

neptune.ai interactive dashboards help ML teams to collaborate and share experiment results with stakeholders across the company.

Here’s an example of how Neptune helped the ML team at Respo.Vision safe time by sharing results in a common environment.

I like the dashboards because we need several metrics, so you code the dashboard once, have those styles, and easily see them on one screen. Then, any other person can view the same thing, so that’s pretty nice. Łukasz Grad, Chief Data Scientist at ReSpo.Vision

Review initial hypotheses about the dataset and the choice of algorithm

Whether you’re returning to a model for the first time or it has been in place for many years, it’s always good to challenge the basic underlying assumptions. Does the algorithm fit the type of data? Is it a typical choice for the machine learning problem, or does it seem odd to you? Can you look at a training data sample and make an educated guess on how a model might be able to predict its associated label?

We’ll revisit these questions in a lot more detail in the following sections, of course. At this point, just follow your intuition, dig into the model’s documentation, and look at what the internet (or we on this blog!) say about today’s best practices.

Once you have a good understanding, it’s time to move on to a more systematic assessment.

Is the model overfitting or underfitting?

The easiest way to detect overfitting and underfitting is to plot the model prediction error as a function of model complexity or the number of training epochs or samples.

Both overfitting (high variance and low bias) and underfitting (high bias and low variance) will show up as a characteristic difference between the training and test error curves.

If the model is overfitting, it can likely be improved by:

using more training data or increasing sample diversity,
reducing model complexity,
applying regularization methods, including Ridge and Lasso regularization,
in the case of neural networks, adding dropout layers,
and early stopping.

If the model is underfitting, you can address this by:

making the model more complex (increasing the number of parameters fitted during training),
adding more input features,
training the model for more epochs.

Improving ML model performance: overfitting vs. underfitting — *ML model performance: overfitting vs. underfitting | Modified based on: source*

What kind of errors is the model making?

A classification problem can typically be analyzed using a confusion matrix. It shows the proportion of type 1 (false positive) and type 2 (false negative) errors.

Figure 2 shows a confusion matrix for a binary classification problem. In this example, we have 15 true positives, 12 false positives, 118 true negatives, and 47 false negatives.

n = 192	Model predicts: 0	Model predicts: 1
True value: 0	118true negatives	12false positives
True value: 1	47false negatives	15true positives

Figure 2. Confusion matrix for a binary classification problem. The model has low precision and extremely poor recall, identifying just 15 of 62 positives!

Based on this data, we can compute several standard metrics:

Precision = 15/(15+12) = 15/27 = 0.56
Recall = 15/(15+47) = 15/62 = 0.24
F1-score = 2 * 0.56 * 0.34 / (0.56 + 0.34) = 0.42

Here, both precision (the number of true positives divided by the total number of positives) and recall (the number of true positives divided by the number of true positives and false negatives) are on the lower side.

Depending on the business use case and domain, either improving recall or focusing on precision makes more sense.

For example, it is crucial to have a high recall when diagnosing a life-threatening disease, such as cancer. High recall ensures that as many patients with the cancer as possible – the true positives – are identified. In this situation, increasing the number of healthy individuals mistakenly classified as having cancer – the false positives – is likely acceptable.

But in the case of video recommendations on YouTube, false positives are detrimental. YouTube’s primary goal is to suggest videos a user will love – and to avoid recommending videos a user will dislike at all costs. However, it’s perfectly acceptable if not all videos a user would enjoy are presented to them. In other words, false negatives are not a major concern. For this kind of problem, precision is the right target metric.

*Normalized confusion matrix for a multi-class ML classification problem | Source*

Figure 3 shows a confusion matrix for a multi-class classification problem. Common examples of this kind of ML problem include predicting a 1-to-5 star rating based on customer reviews and identifying the language a text is written in.

At first glance, we see that the model is confusing classes 1 to 5 with class 0. This suggests that there’s a systematic error in the model, most likely to do with class 0. The underlying reason could be a highly imbalanced training dataset. If the model mostly sees samples of class 0, its best guess is to always predict 0.

We also see that the model has difficulties with some neighboring classes: For a sample with the true label 4, it predicts 3 and 4 with equal probability. It exhibits similar behavior for samples of classes 2 and 3. A good next step would be to check the training samples for potential annotation errors and examine the similarity between the samples for the classes the model confuses.

How to improve ML model performance using hyperparameter optimization

After initial analysis and evaluation of the model’s accuracy and visualization of key metrics to diagnose potential errors, you should see if you can extract additional performance from the current model by retraining it with a different set of hyperparameters.

The assumption underlying a trained machine learning or deep learning model is that the current set of model weights and biases correspond to local minima during the convex optimization process. An optimization algorithm should ideally yield a global minimum corresponding to optimal model weights.

However, training of machine learning models is often a stochastic process. The outcome varies depending on several parameters, including how the model’s parameters or weights are initialized, the learning rate schedule, the number of training epochs, and applied regularization methods.

Hyperparameter tuning involves training separate versions of the models, each with a different combination of hyperparameters.

Typically, for smaller machine learning models, it’s a quick process and helps identify the model with the highest accuracy. For more complex models, including deep neural networks, running several iterations of the same model on different combinations of hyperparameter values may not be feasible. In such cases, it’s prudent to limit the range and choice of individual hyperparameter values based on prior knowledge to find the most optimal model.

Let’s discuss some common hyperparameter tuning methods:

How to improve ML model performance using grid search

Grid search is a common hyperparameter optimization method. It attempts to find the optimal set of hyperparameters by evaluating all possible combinations.

It’s most useful when the range of hyperparameters can be restricted in advance, either based on previous experiments or insights from the literature. For instance, if you have identified six key hyperparameters and five possible values for each hyperparameter, this gives rise to 5**6 = 15,625 different combinations. If it takes you a minute to train and evaluate a model, we’re talking about almost 11 days of non-stop model training.

Aside from the fact that it’s computationally expensive, grid search suffers from the fact that it only samples from a few select locations in the high-dimensional hyperparameter spaces. As shown in Figure 4, it can never find optimal hyperparameter combinations if they lie outside of the pre-defined range.

Random search aims to alleviate these drawbacks.

How to improve ML model performance using random search

Unlike grid search, which systematically explores every combination of hyperparameters within predefined ranges, random search randomly samples hyperparameters from these predefined ranges or distributions. This creates more diverse hyperparameter combinations and increases the chance of finding an optimal minimum, as shown in Figure 4.

It excels when dealing with large or complex hyperparameter spaces, as it doesn’t require evaluating every possible combination.

Despite its advantages, there is no guarantee that grid search finds the absolute best set of hyperparameters. In contrast to grid search, which systematically scans the defined hyperparameter subspace, random search might also leave large regions unexplored.

*How to improve ML model performance using random search: a comparison of grid search and random search |* S *ource*

How to improve ML model performance using Bayesian search

Bayesian search is a sophisticated hyperparameter optimization method that works by iteratively improving the best-known set of hyperparameters.

Let’s say we’re optimizing a model’s accuracy. Bayesian search will evaluate different hyperparameter configurations and infer a probabilistic model of the mapping from the hyperparameter space to accuracy. In other words, it tries to predict the model’s accuracy based on the hyperparameter combination. Based on the currently best-known estimate for this relationship, Bayesian search selects the next hyperparameter combinations to try.

Bayesian search often yields more optimal solutions than random search, as shown in Figure 5. Compared to grid and random search, Bayesian optimization better balances exploring new regions of the hyperparameter space and exploiting the acquired knowledge regarding a hyperparameter’s influence on the machine learning model’s performance.

I recommend the paper Practical Bayesian Optimization of ML Algorithms for a thorough introduction.

How to improve ML model performance using Bayesian search: a comparison of random search and Bayesian search — *How to improve ML model performance using Bayesian search*: *a comparison of random search and Bayesian search | Source*

How to improve ML model performance using AutoML baselines

Cloud platforms like AWS, Microsoft Azure, and Google Cloud Platform offer fully managed AutoML tools. Under the hood, they use some combination of the hyperparameter optimization methods we just discussed – and some secret sauce added by the companies’ ML experts.

Even if you’re not using AutoML in your project, these services can help you in your model improvement projects. Specifically, they can help you answer whether a model can be improved further, given the currently available training data.

Instead of setting up hyperparameter optimization yourself, you can task the AutoML solution with creating a model for you. For our purposes, it usually makes sense not to restrict the choice of algorithm or training duration but to aim for the best model possible. (Yes, that might take a while and be pricey, but compare that to spending days or weeks hand-rolling a hyperparameter optimization setup just to discover your problem was the data all along.)

What you’ll end up with is often a pretty strong baseline. Suppose the AutoML’s model is not significantly better than yours and perhaps even makes the same errors or struggles with the same kinds of samples. That’s a strong hint that you should look at your data instead of the algorithm or hyperparameters.

If the AutoML’s model is better than yours, inspecting what solution the AutoML process arrived at might inform your model-building going forward. In one of my projects, for example, I switched out the input encoders after seeing how the AutoML’s model dealt with the high-cardinality categorical features in the data.

How to improve ML model performance through feature engineering

Typical machine learning models use numerous input features. They greatly influence how your model will turn out:

Too few features make the model less robust to variations in the data. It may struggle to generalize well to unseen data or adapt to changes in the data distribution.
Too many features can lead to overfitting and worse generalization to unseen data. The model may also get plagued by the curse of dimensionality.
Having highly correlated features can lead to overfitting. They can also degrade the results of explainability approaches, such as feature importance estimations.
Having little to no correlation between features and the target variable can lead to decreased predictive power, as there is no meaningful relationship the model could capture. If only a tiny fraction of the features fall into that group, they might just add a bit of noise, but too many, and your model might not even learn at all.

The process of getting the input features just right is called feature engineering.

Engineering new features to decrease entropy

Devising new features requires domain expertise and a good grasp of the model’s underlying algorithm. The goal is to craft features from the data that capture aspects of the complex nonlinear function that the machine learning model is learning to approximate during training.

For example, to improve the accuracy of a loan default prediction, one possible new feature is the loan-to-value ratio (LTV):

LTV = (loan amount)/(property value)

It measures the borrower’s equity in the property. A high LTV means that the borrower has a small amount of equity in the property, which makes them more likely to default on the loan. Using the LTV adds domain knowledge to the input feature set. Using the newly calculated LTV as an input feature instead of loan amount and property value decreases collinearity and the overall entropy of the system, helping the model to learn better underlying patterns.

Feature selection using SHAP values

Feature selection via programmatic approaches can help remove correlated, redundant, or uninformative features that don’t contribute to a model’s performance. Methods to iteratively build and evaluate a model with a progressively increasing set of features or iteratively reducing one feature at a time from a model trained with the entire set of features help to identify robust features. For example, scikit-learn provides well-documented implementations of several standard approaches.

A more advanced method is to utilize SHAP values to determine how a feature contributes to the final prediction. Let’s look at an example to understand this better.

Features ordered from the highest to the lowest effect on the prediction | Modified base on: source

In the image above, certain features contribute more to the model’s prediction than others. Removing features with small contributions can make the model less complex, easier to reason about, and more robust.

How to improve ML model performance by improving data

So far, we’ve talked about improving the performance of machine learning models from the algorithmic perspective by analyzing metrics and feature engineering. Let’s look at it from the data side.

Data plays a huge role in making or breaking the model. The age-old saying “Garbage in, Garbage out” applies in this case as well.

We will examine several frequently used methods, including data quality checks and refinement, active learning, and synthetic data.

Improving ML models through ensuring and refining input data quality

Data quality checks are imminent whenever you train machine learning models, particularly in the case of tabular or otherwise structured data. At the very least, you should routinely…

Check for consistent data schemas: Your database table schemas and any intermediate representations of the data across your preprocessing pipeline should match. Thanks to a range of mature open-source tools, this is not as intricate or tedious as it might sound: Great Expectations is a fantastic tool to implement checks for assumptions about data, while Pydantic is an easy-to-adopt library to define typed data models in Python.

Coalesce missing values: If your model cannot deal with unknown values, removing or imputing any missing values in your data is the way to go. You can use techniques like scikit-learn’s SimpleImputer, which replaces missing values with a feature’s mean, median, or most frequent value. In the case you’re loading data from a database table, you can utilize SQL functions like COALESCE(), IFNULL(), and NVL() to fill or impute missing values.
Remove outliers: Training models on outliers might result in skewed results and compromised evaluation metrics. Hence, detecting and either removing or standardizing those anomalies is necessary for consistent statistics. You can use boxplots to identify outliers visually, remove all data points that differ from the mean by several standard deviations, or apply Tukey’s method.
Eliminate duplicates: Duplicate data leads to flawed evaluation metrics because the same correct or wrong prediction is counted multiple times. Having the same data point in both the training and test set is a case of data leakage, which generally inflates evaluation metrics.

But what if you have too little data?

Improving ML model performance with active learning

Imagine your model performs well overall but struggles in specific cases, returning wrong predictions with low confidence levels.

A typical reason is that there aren’t enough training samples for the model to pick up the signals in the data. That’s often the case in image classification when each input image consists mostly of meaningless background.

Another typical case is that the relationship between the input features and the target variable is too complex for the problematic samples. For example, in a credit default prediction case, the problem might be an edge case where a single feature of otherwise little relevance becomes the decisive factor.

The active learning process consists of four phases: Identifying samples from the data the current model is uncertain about (Query), labeling these samples (Label) and adding them to the training data set (Enrich), and finally, training the next model (Train) — The active learning process consists of four phases: Identifying samples from the data the current model is uncertain about (Query), labeling these samples (Label), adding them to the training data set (Enrich), and training the next model (Train) | Modified based on: source

Active learning is the process of iteratively selecting the most informative samples from a large pool of unlabeled data and adding them to the training set. The key idea is to strategically choose data points for labeling the model is uncertain about. This precisely creates the training samples the model needs to improve its performance.

How to improve ML model performance using data augmentation

The lack of training data is a common bottleneck for developing and improving supervised machine learning models. If even obtaining unlabeled samples is costly or time-consuming, data augmentation is an approach worth considering.

Instead of adding entirely new samples to the training data set, data augmentation techniques generate new samples by modifying existing ones. This allows you to increase the size of your training data set in a scalable fashion. However, it will not increase the diversity of samples significantly.

Which data augmentation technique is right for your problem depends on the kind of data. For example, you can generate new image samples by altering the brightness, hue, and orientation or cropping them. Remember that you have to ensure the original sample’s label is still true for the new sample.

If data augmentation is impossible or does not lead to the required improvements, you can consider generating completely new samples.

Improving data coverage by generating synthetic samples

In some cases, you may face a scarcity of data. If acquiring more data is infeasible, manually or synthetically generating data is an avenue worthwhile exploring.

A simple way to create synthetic data is to sample from a probability distribution whose summary statistics mimic that of the original data set. In the case of tabular data, computing a Kernel Density Estimate is a computationally efficient way to obtain a probability distribution for generating realistic new data points.

Generally, when people talk about synthetic data today, all eyes are on generative models. Thanks to the rapid advances in the space of transformer-based models, it’s now possible to generate synthetic images, text, and even audio and video. For example, you could generate synthetic MRI images to augment a medical image dataset, helping you overcome the challenge of obtaining large amounts of real medical images.

You can also leverage methods based on weak supervision, semi-supervised learning, student-teacher learning, and self-supervised learning to generate training data with noisy labels. These methods are based on the premise that augmenting gold standard labeled data with unlabeled or noisily labeled data leads to significantly improved model performance. For a well-known example of such an advanced method, have a look at the foundational Snorkel paper.

However, no matter which of these approaches you use, you need to remember that synthetic data is generated by computer simulations. It reflects real-world data mathematically or statistically and can be just as good or even better for training a machine learning model than collected data. However, if the computer simulation does not truly reflect the real world in some aspect, your model will likely struggle with that aspect as well.

Improving deep learning performance by using pre-trained models

Instead of training a deep learning model on a large data set yourself, in certain cases, you can save valuable time and effort by building your models on pre-trained models. There are various sources for freely available state-of-the-art baseline models. For instance, you can access Llama 2 from HugginFace’s model hub directly.

The main advantage of using pre-trained models is that someone else has already taken care of the basic task performance. A ResNet-50 model pre-trained on ImageNet will already have learned to detect structures and objects in images. All that’s left to do on your end is to train the model on your specific computer vision task, a process fittingly called “finetuning.”

However, an important caveat is that such pre-trained models often carry problems with them. For example, most large language models (LLMs) are pre-trained on vast amounts of text data scraped from the internet. So don’t be surprised if they parrot back phrases they’ve learned from some obscure Reddit discussion or echo questionable political takes.

ML model improvement checklist

To summarize, here’s the model improvement checklist I follow in all my projects:

ML model performance improvement checklist | Source: Author

Check if what your model sees in production is what it sees during training/evaluation
That’s such a common source of underperforming models that I routinely check this even before I familiarize myself with the model. This is straightforward if you have proper logging and ML model monitoring in place – and discovering that a data mismatch is the root of the problem is a great argument to push for establishing these invaluable practices.
Understand what you are improving
Are you trying to improve an evaluation metric or the model’s explainability? Both will require you to work on different aspects of the model.
Review the choice of algorithms
Once you’re clear on what you’re trying to improve, look at the models you’re using. Ask actionable questions like “Is it the best algorithm for the task?” For example, tree-based models like random forests or XGBoost work well if you care about explainability and want to trace back the roots of the prediction.
Have you tried hyperparameter optimization?
If you’re happy with your algorithm, try tuning your hyperparameters with a grid or random search. Try and milk the most from the current training data set, perhaps even employing an AutoML tool. This will also give you an idea of how good the training data is.
If you’re still not where you need to be, add more quality data
If you’ve exhausted tweaking the algorithmic side, it’s time to look at the data. Make sure that your data is clean and well-structured, and generate new samples through active learning or generative models.

Conclusion

Improving machine learning models is both a skill and an art. But it’s not magic: If you systematically analyze and address deficiencies in your model, you are almost guaranteed to see at least some improvement. Over time, you’ll develop an intuition and grow your machine learning toolbox. But don’t expect never to be puzzled. Personally, I’ve found that each new machine-learning project comes with surprising insights.

In this article, I have reviewed a set of methods focused on models, their hyperparameters, and the underlying data to improve and update models to attain the required performance levels. Now, it’s up to you to choose what’s best for your particular project.