We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

How to Deal With Imbalanced Classification and Regression Data

Data imbalance is predominant and inherent in the real world. Data often demonstrates skewed distributions with a long tail. However, most of the machine learning algorithms currently in use were designed around the assumption of a uniform distribution over each target category (classification). 

On the other hand, we must not forget that many tasks involve continuous targets and even infinite values (regression), where hard boundaries between classes do not exist (i.e. age prediction, depth estimation, and so on).

Data imbalance
Data imbalance | Source: Author

In this article, I’m going to walk you through how to deal with imbalanced data in classification and regression tasks as well as talk about the performance measures you can use for each task in such a setting.

There are 3 main approaches to learning from imbalanced data:

  • 1Data approach
  • 2Algorithm approach 
  • 3Hybrid (ensemble) approach

Imbalanced classification data

SMOTE for regression
SMOTE for regression | Source

SMOTE Imbalanced classification is a well explored and understood topic.

In real-life applications, we face many challenges where we only have uneven data representations in which the minority class is usually the more important one and hence we require methods to improve its recognition rates. This issue poses a serious challenge to predictive modeling because learning algorithms will be biased towards the majority class

Important day-to-day tasks in our lives such as preventing malicious attacks, detecting life-threatening diseases, or handling rare cases in monitoring systems face extreme class imbalance with ratios ranging from 1:1000 up to 1:5000 and one must design intelligent systems that can adjust and overcome such extreme bias.

How to handle an imbalanced dataset – data approach

How would you handle an imbalanced dataset?
How would you handle an imbalanced dataset? | Source

It concentrates on modifying the training set to make it suitable for a standard learning algorithm. This can be done by balancing the distributions of the dataset which can be categorized in two ways:

  • Oversampling 
  • Undersampling 

1. Oversampling

Oversampling
Oversampling | Source

In this approach, we synthesize new examples from the minority class. 

There are several methods available to oversample a dataset used in a typical classification problem. But the most common data augmentation technique is known as Synthetic Minority Oversampling Technique or SMOTE for short. 

Scatter plot of the class distribution before and after SMOTE
Scatter plot of the class distribution before and after SMOTE | Source

As the name suggests, SMOTE creates “synthetic” examples rather than over-sampling with replacement. Specifically, SMOTE works the following way. It starts by randomly selecting a minority class example and finding its k nearest minority class neighbors at random. Then a synthetic example is created at a randomly selected point in the line that connects two examples in feature space.

SMOTE
SMOTE | Source

The created synthetic examples from SMOTE for the minority class when added to the training set, balance the class distributions and cause the classifier to create larger and less specific decision regions helping the classifier generalize better and mitigate overfitting, rather than smaller and more specific regions which will cause the model to overfit to the majority class.

Decision boundaries
Decision boundaries | Source

This approach is inspired by data augmentation techniques that proved successful in handwritten character recognition where operations like rotation and skew were natural ways to perturb the training data.

Now, let’s take a look at the performance of SMOTE.

Confusion matrix of classifiers trained on data synthetic examples and tested on the imbalanced test set
Confusion matrix of classifiers trained on data synthetic examples and tested on the imbalanced test set | Source

From the confusion matrix we can notice a few things:

  • The classifiers trained on synthetic examples generalize well.
  • The classifiers Identify the minority class well (True Negatives).
  • They have fewer False Positives compared to undersampling.
Advantages 
  • It improves the overfitting caused by random oversampling as synthetic examples are generated rather than a copy of existing examples.
  • No loss of information.
  • It’s simple.
Disadvantages 
  • While generating synthetic examples, SMOTE does not take into consideration neighboring examples that can be from other classes. This can increase the overlapping of classes and can introduce additional noise.
  • SMOTE is not very practical for high-dimensional data.

2. Undersampling

Undersampling
Undersampling | Source

In this approach, we reduce the number of samples from the majority class to match the number of samples in the minority class

Scatter plot of the class distribution before and after applying NearMiss-2
Scatter plot of the class distribution before and after applying NearMiss-2 | Source

This can be done in a couple of ways:

  1. Random sampler: It is the easiest and fastest way to balance the data by randomly selecting a few samples from the majority class.
  2. NearMiss: Adds some common sense rules to the selected samples by implementing 3 different heuristics, but in this article, we will only focus on one.
    • NearMiss-2 Majority class examples with a minimum average distance to three furthest minority class examples.
Confusion matrix of classifiers trained on undersampled examples and tested on the imbalanced test set
Confusion matrix of classifiers trained on undersampled examples and tested on the imbalanced test set | Source

From the confusion matrix we can notice a few things:

  • Undersampling performs poorly compared to oversampling when it comes to identifying the majority class (True Positive). But besides that, it identifies the minority class better than oversampling and has fewer False Negatives.
Advantages
  • Data scientists can balance the dataset and reduce the risk of their analysis or machine learning algorithm skewing toward the majority. Because without resampling, scientists might come up with what is known as the accuracy paradox where they run a classification model with 90% accuracy. On closer inspection, though, they will find the results are heavily within the majority class. 
  • Fewer storage requirements and better run times for analyses. Less data means you or your business needs less storage and time to gain valuable insights. 
Disadvantages
  •  Removing enough majority examples to make the majority class the same or similar size to the minority class results in a significant loss of data.
  • The sample of the majority class chosen could be biased, meaning, it might not accurately represent the real world, and the result of the analysis may be inaccurate. Therefore, it can cause the classifier to perform poorly on real unseen data.

Because of these disadvantages, some scientists might prefer oversampling. It doesn’t lead to any loss of information, and in some cases, may perform better than undersampling. But oversampling isn’t perfect either. Because oversampling often involves replicating minority events, it can lead to overfitting.

“The combination of SMOTE and under-sampling performs better than plain under-sampling.”

SMOTE: Synthetic Minority Over-sampling Technique, 2011

To balance these issues, certain scenarios might require a combination of both over and undersampling to obtain the most lifelike dataset and accurate results. 

How to handle imbalanced data – algorithm approach

Algorithm approach – best models for imbalanced classification
Algorithm approach – best models for imbalanced classification | Source

This approach concentrates on modifying existing models to alleviate their bias towards the majority groups. This requires good insight into the modified learning algorithm and precise identification of reasons for its failure in learning the representations of skewed distributions. 

The most popular techniques are cost-sensitive approaches (weighted learners). Here, the given model is modified to incorporate varying penalties for each considered group of examples. In other words, we use Focal loss where we assign a higher weight to the minority class in our cost function which will penalize the model for misclassifying the minority class while at the same time reducing the weight of the majority class, causing the model to pay more attention to the underrepresented class. Thus, boosting its importance during the learning process.

Another interesting algorithm-level solution is to apply one-class learning or one-class classification(OCC for short) that focuses on the target group, creating a data description. This way we eliminate bias towards any group, as we concentrate only on a single set of objects.

OCC can be useful in imbalanced classification problems because it provides techniques for outlier and anomaly detection. It does this by fitting the model on the majority class data (also known as positive examples) and predicting whether new data belong to the majority class or belong to the minority class(also known as negative examples) meaning it’s an outlier/anomaly. 

OCC problems usually are practical classification tasks where majority class data is easily available but minority class is hard, expensive, and even impossible to gather, i.e. work of an engine, fraudulent transactions, intrusion detection for the computer system, and so on.

How to deal with imbalanced data – hybrid approach

Hybrid approach
Hybrid approach | Source

Hybridization is an approach that exploits the strengths of individual components. When it comes to dealing with imbalanced classification data, some works proposed hybridization of sampling and cost-sensitive learning. In other words, combining both data and algorithm level approaches. This idea of two-stage training that merges data-level solutions with algorithm-level solutions (i.e. classifier ensemble), resulting in robust and efficient learners is highly popular. 

Example scheme of the hybrid approach
Example scheme of the hybrid approach | Source

It works by applying a data-level approach first. As you remember the data level approach works by modifying the training set to balance the class distribution between the majority class and the minority by using either oversampling or undersampling. 

Then the pre-processed data with balanced class distribution is used to train a classifier ensemble, in other words, a collection of multiple classifiers from which a new classifier is derived which performs better than any constituent classifier. Thus, creating a robust and efficient learner that inherits the strong points of both data and algorithm level approaches while reducing their weaknesses at the same time. 

Confusion matrix of hybrid classifiers trained and tested on the imbalanced test set
Confusion matrix of hybrid classifiers trained and tested on the imbalanced test set | Source

From the confusion matrix we can notice a few things:

  • The hybrid classifiers perform better than undersampling when it comes to identifying the majority class 
  • And, is almost as good as both undersampling and oversampling when it comes to identifying the minority class.

Basically takes the best of both worlds!

Performance measures for imbalanced classification

In this section, we review the common performance measures used and their effectiveness when addressing imbalanced classification data.

  • Confusion matrix
  • ROC and AUC
  • Precision recall
  • F-score

1. Confusion matrix

For binary classification problems, the confusion matrix defines the base for performance measures. Most of the performance metrics are derived from the confusion matrix, i.e., accuracy, misclassification rate, precision, and recall.

Confusion matrix
Confusion matrix | Source

However, accuracy is not appropriate when the data is imbalanced. Because the model can achieve higher accuracy by just predicting accurately the majority class while performing poorly on the minority class which in most cases is the class we care about the most.

2. ROC and AUC imbalanced data

ROC and AUC imbalanced data
ROC and AUC imbalanced data | Source 

To accommodate the minority class, the Receiver Operating Characteristic (ROC) curve is proposed as a measure over a range of tradeoffs between the True Positive (TP) Rate and False Positive (FP) Rate. Another important performance measure is Area Under the Curve (AUC) is a commonly used performance metric for summarizing the ROC curve in a single score. Moreover, AUC is not biased towards the model’s performance on either the majority or minority class, which makes this measure more appropriate when dealing with imbalanced data.

3. Precision and recall

From the confusion matrix, we can also derive precision and recall performance metrics.

Precision and recall
Precision and recall | Source

Precision is great for class imbalance and it’s not affected by it because it doesn’t include the number of True Negatives in its calculation.

One drawback of precision and recall is that like accuracy there might be some imbalance between the two where we want to improve TP for the minority class, however, the number of FP can also increase. 

4. F-score

To balance the recall and precision, i.e., improving recall, while keeping precision low, the F-score is proposed as a harmonic mean of the precision and recall.

F-score
F-score | Source

Since the F-score weights, precision, and recall equally and balances both concerns, it is less likely to be biased to the majority or minority class. [2]

Check this experiment with the 3 imbalance classification approaches code examples in the Colab notebook I prepared for you.

Imbalanced regression data

Imbalanced regression data
Imbalanced regression data | Source

Regression over imbalanced data is not well explored. And, many important real-life applications like the economy, crisis management, fault diagnosis, or meteorology require us to apply regression over imbalanced data which means predicting rare and extreme continuous target values from input data.

Because dealing with imbalanced data is a relevant problem that has been studied mostly in the context of classification tasks, there are scarce mature or suitable strategies to address it in the context of regression.

Let’s first look at the typical approaches adopted from Imbalanced Classification then we will look into some of the best Imbalanced Regression techniques currently being used.

Approachas adopted from imbalanced classification

Data approach

Adopted from Imbalanced classification
Adopted from Imbalanced classification | Author: Prince Canuma

When it comes to data approaches for imbalanced regression we have two techniques that were heavily inspired on imbalanced classification:

  • SMOTER
  • SMOGN
1. SMOTER

SMOTER is an adaptation for regression of the well-known SMOTE algorithm.

It works by defining frequent (majority) and rare (minority) regions using the original label density and then applying random undersampling to the majority region and oversampling to the minority region, where the user has to pre-determine the percentage of over and undersampling to be carried out by the SMOTER algorithm.

When it comes to oversampling the minority regions it not only generates new synthetic examples it also applies an interpolation strategy that combines inputs and targets of different examples. Precisely, this interpolation is carried out using two rare cases where one is a seed case and the other is randomly selected from the k-nearest neighbors to the seed. The features of the two cases are interpolated, and the new target variable is determined as a weighted average of the target variables of the two rare cases used.

Why do we have to average the target variables you might ask? Remember that in the original SMOTE algorithm, this was a trivial question, because as all rare cases have the same region (the target minority region), but in the case of regression the answer is not so trivial because when a pair of examples are used to generate a new synthetic case, they will not have the same target variable value.

2. SMOGN

SMOGN takes after SMOTER but it further adds Gaussian Noise to the oversampling phase alongside the one SMOTER already has.

The key idea of SMOGN algorithm is to combine both SMOTER and Gaussian Noise strategies for generating synthetic examples to simultaneously limit the risks that SMOTER can incur such as lack of diverse examples by using the more conservative strategy of introducing Gaussian Noise because SMOTER will not use the most distant examples in the interpolation process. It works by generating new synthetic examples with SMOTER only when the seed example and the k-nearest neighbor selected are close enough and using the Gaussian noise when the two examples are more distant.

Algorithm approach

Algorithm approach
Algorithm approach | Source: Author

Like in imbalanced classification this approach also includes adjusting the loss function to compensate for region imbalance (re-weighting) and other relevant learning paradigms such as transfer learning, metric learning, two-stage training, and meta-learning [4]. But we will focus on the first 2 paradigms:

  • Error-aware loss
  • Cost-sensitive re-weighting 
1. Error-aware loss

It is the regression version of the Focal Loss for classification called Focal-R. Focal loss is a dynamically weighted cross-entropy loss, where the weighting factor(alpha) decays to zero as confidence in the correct class increases.

The focal loss down weights easy examples with a weighting factor of  - (1-  pt)^γ
The focal loss down weights easy examples with a weighting factor of  – (1-  pt)^γ | Source

Focal-R replaces the weighting factor by a continuous function that maps the absolute error(L1 distance) into values in the range of 0 to 1.

Precisely, Focal-R loss based on L1 distance can be written as:

Focal-R loss based on L1 distance
Focal-R loss based on L1 distance | Source

Where ei is the L1 error for i-th sample, σ(·) is the Sigmoid function, and β, γ are hyper-parameters.

2. Cost-sensitive re-weighting

Since the target space can be divided into finite bins, classic re-weighting schemes can be directly plugged in, such as inverse-frequency weighting(INV) and its square-root weighting variant(SQINV) both of which are based on the label distribution.

Hybrid approach

Hybrid approach
Hybrid approach | Source

It takes after the hybrid approach for imbalanced classification. 

Like the hybrid approach for imbalanced classification, the imbalanced regression hybrid approach also combines data level and algorithm level approaches in order to produce robust and efficient learners.

An example of this approach is the Bagging-based ensemble.

Bagging-based ensemble

This algorithm incorporates data pre-processing strategies for addressing imbalanced domains in regression tasks.

Precisely, a paper entitled “REBAGG: REsampled BAGGing for Imbalanced Regression” proposes an algorithm that obtains diversity on the generated models while simultaneously biasing them towards the least represented and more important cases.

It has two main steps:

  1. Build a number of models using pre-processed samples of the training set.
  2. Use the trained models to obtain predictions on unseen data by applying an averaging strategy (basically averaging models’ predictions to obtain the final predictions).

Regarding the first step, the authors developed four main types of resampling methods to apply to the original training set: balance, balance.SMT, variation, and variation.SMT. The key distinguishing feature of these methods is related with: 

i) the ratio between the number of minority and majority examples used in the new sample; and,

ii) how new minority examples are obtained.

On the resampling methods labeled with the prefix “balance”, the new modified training set will have the same number of minority and majority examples. On the other hand, for resampling methods with the prefix “variation”, the ratio of minority to majority examples in the new training set will vary.

When the resampling method has no suffix appended, then the new synthetic examples for minority region are obtained by using exact copies of randomly selected minority examples. And when the suffix “SMT” is appended the new synthetic examples for the minority region are obtained using the SMOTER algorithm.

Deep Imbalanced Regression (DIR)

The methods adopted from imbalanced classification work; however, there are several drawbacks to using them alone.

Allow me to make a case!

Figure 1. Comparison on the test error distribution (bottom) using the same training label distribution (top) on two different datasets
Figure 1. Comparison on the test error distribution (bottom) using the same training label distribution (top) on two different datasets | Source

The above datasets have intrinsically different label spaces (a) CIFAR-100 exhibits categorical label space where the target is a class index while (b) IMDB-WIKI exhibits continuous label space where the target is age.

As you can see the label density distribution is the same for both but the error distribution is very different. The error distribution for IMDB-WIKI is much smoother and doesn’t correlate well with the label density distribution and this affects how imbalanced learning methods work because directly or indirectly, they operate by compensating for the imbalance in the empirical label density distribution. This approach works well for imbalanced classification but not for continuous labels. Instead, you have to find a way to smooth the label distribution.

Label distribution smoothing (LDS) for imbalanced data density estimation

Figure 2. Label distribution smoothing(LDS) convolves a symmetric kernel with the empirical label density to estimate the effective label density distribution that accounts for the continuity of labels
Figure 2. Label distribution smoothing (LDS) convolves a symmetric kernel with the empirical label density to estimate the effective label density distribution that accounts for the continuity of labels | Source

From figure 2 above we can see that in the continuous space empirical label distribution does not match the real label density distribution. Why is this? Because of the dependence between data samples at nearby labels, in this case, we are talking about images of close age.

LDS uses kernel density estimation to learn the effective imbalance in datasets that corresponds to continuous targets. Precisely, LDS convolves a symmetric kernel with the empirical density distribution to extract a kernel-smoothed version that accounts for the overlap in the information of data samples of nearby labels.

Note: Gaussian or a Laplacian kernel is a symmetric kernel.

The symmetric kernel characterizes the similarity between target values y’ and y w.r.t their distance in the target space. 

Figure 2 at the beginning of this section shows that LDS captures the real imbalance that affects regression. By applying LDS we get a label density distribution that correlates well with error distribution (-0.83).

Once you have the effective label density, you can then use the adapted techniques for addressing imbalanced classification that we talked about earlier (i.e. cost-sensitive re-weighting method).

Feature distribution smoothing (FDS)

Feature distribution smoothing (FDS)
Top: Cosine similarity of the feature means at a particular age w.r.t its value at the anchor age. Bottom: Cosine similarity of the feature variance at a particular age w.r.t its value at the anchor age. The color of the background refers to data density in a particular target range | Source

The above figure displays the feature statistics similarity for age 30 (anchor). And you can right away notice that the bins that surround the anchor are highly similar to the anchors, especially the closest ones. But examining the figure further you will notice that there is a problem with regions with very few data samples (i.e. age 0-6 years). Due to data imbalance, the mean and variance show an unjustified high similarity to age 30. 

The creators of the Feature distribution smoothing (FDS) algorithm were inspired by these observations and proposed this algorithm that performs distribution smoothing on the feature space, or in other words, transfers feature statistics between nearby target bins. Thus calibrating the potentially biased estimates of feature distribution, especially for underrepresented target values in training data.

Feature distribution smoothing (FDS)
Feature distribution smoothing (FDS) | Source

And one great thing about FDS is that you can integrate it into deep neural networks by inserting a feature calibration layer after the final feature map.

Benchmarking 

Benchmarking results on STS-B-DIR
Benchmarking results on STS-B-DIR | Source

Results reported on the Semantic Textual Similarity Benchmark (STS-B-DIR) dataset using various algorithms.

The authors show that when LDS and FDS are coupled with other existing methods to address regression over imbalanced data significantly improves the performance [4].

Performance measures for imbalance regression

When it comes to evaluation metrics for this kind of problem, you can use the common metrics for regression such as MAE, MSE, Pearson, Geometric Mean(GM) alongside the techniques we explored in this section.

Crucial open issues to address when developing novel methods for Imbalanced regression

  • Development of cost-sensitive regression solutions that can adapt the cost to the degree of importance assigned to rare observations. To allow for more flexibility in predicting rare events of differing importance it would be rather interesting to investigate the possibility of adapting the cost not only to the minority group but to each individual observation.
  • Methods that will allow distinguishing between minority and noisy samples must be proposed.
  • Development of better ensemble learning methods as in classification may offer a significant improvement in both robustness to skewed distributions and predictive power.

Conclusion

Canonical ML algorithms assume that the number of objects in considered classes is roughly similar. However, in many real-life problems that we can apply ML to, the distribution of examples is skewed since the events that we care the most about and want to predict happen rarely and for the most part, we collect data points of normal events which represent the normal state and majority group. This poses a difficulty for learning algorithms, as they will be biased towards the majority group.

But in this article, you learned about the different approaches to learning from imbalanced classification and regression data. 

Thank you for reading! And as always I have a well-researched reference section that you can use to dive deeper into what you read below as well as a colab notebook.

References

  1. https://link.springer.com/content/pdf/10.1007/s13748-016-0094-0.pdf
  2. https://arxiv.org/abs/2104.02240
  3. https://arxiv.org/pdf/1106.1813.pdf
  4. Deep Imbalanced regression
    1. https://towardsdatascience.com/strategies-and-tactics-for-regression-on-imbalanced-data-61eeb0921fca
    2. https://arxiv.org/pdf/2102.09554.pdf
  5. https://imbalanced-learn.org
  6. https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
  7. https://machinelearningmastery.com/one-class-classification-algorithms/
  8. https://dataaspirant.com/handle-imbalanced-data-machine-learning/
  9. https://imbalanced-learn.org/stable/auto_examples/ensemble/plot_comparison_ensemble_classifier.html
  10. https://imbalanced-learn.org/stable/ensemble.html
  11. https://imbalanced-learn.org/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn
  12. https://www.fromthegenesis.com/smote-synthetic-minority-oversampling-technique/
  13. https://www.datacamp.com/community/tutorials/diving-deep-imbalanced-data
  14. https://www.reddit.com/r/datascience/comments/92az1l/how_to_handle_imbalanced_classification_problem/e34e64k
  15. https://www.sciencedirect.com/topics/engineering/decisions-region
  16. https://www.amazon.com/dp/1118074629/ref=as_li_ss_tl?&linkCode=sl1&tag=inspiredalgor-20&linkId=615e87a9105582e292ad2b7e2c7ea339&language=en_US
  17. Hybrid Classifiers—Methods of Data, Knowledge, and Classifier Combination. In: Studies in Computational Intelli- gence, vol. 519. Springer, Berlin (2014)
  18. https://www.coursera.org/learn/ml-regression
  19. https://researchcommons.waikato.ac.nz/bitstream/handle/10289/8518/smoteR.pdf
  20. https://arxiv.org/abs/1708.02002v2
  21. https://www.mastersindatascience.org/learning/statistics-data-science/undersampling/

READ NEXT

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

5 mins read | Jakub Czakon | Updated October 20th, 2021

Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.

You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.

It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.

It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.

So check out these top tools for data version control that can help you automate work and optimize processes.

Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage. 

They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.

How to choose a data versioning tool?

To choose a suitable data versioning tool for your workflow, you should check:

  • Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
  • Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
  • Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
  • How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
  • Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind. 

Here’re are a few tools worth exploring.

Continue reading ->
Feature store and data ingestion mlops

How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack

Read more
Feature selection methods

Feature Selection Methods and How to Choose Them

Read more
EDA for tabular data

Exploratory Data Analysis for Tabular Data

Read more
Recommender system lessons

Recommender Systems: Lessons From Building and Deployment

Read more