Data imbalance is predominant and inherent in the real world. Data often demonstrates skewed distributions with a long tail. However, most of the machine learning algorithms currently in use were designed around the assumption of a uniform distribution over each target category (classification).
On the other hand, we must not forget that many tasks involve continuous targets and even infinite values (regression), where hard boundaries between classes do not exist (i.e. age prediction, depth estimation, and so on).
In this article, I’m going to walk you through how to deal with imbalanced data in classification and regression tasks as well as talk about the performance measures you can use for each task in such a setting.
There are 3 main approaches to learning from imbalanced data:
- 1 Data approach
- 2 Algorithm approach
- 3 Hybrid (ensemble) approach
Imbalanced classification data
SMOTE Imbalanced classification is a well explored and understood topic.
In real-life applications, we face many challenges where we only have uneven data representations in which the minority class is usually the more important one and hence we require methods to improve its recognition rates. This issue poses a serious challenge to predictive modeling because learning algorithms will be biased towards the majority class.
Important day-to-day tasks in our lives such as preventing malicious attacks, detecting life-threatening diseases, or handling rare cases in monitoring systems face extreme class imbalance with ratios ranging from 1:1000 up to 1:5000 and one must design intelligent systems that can adjust and overcome such extreme bias.
How to handle an imbalanced dataset – data approach
It concentrates on modifying the training set to make it suitable for a standard learning algorithm. This can be done by balancing the distributions of the dataset which can be categorized in two ways:
In this approach, we synthesize new examples from the minority class.
There are several methods available to oversample a dataset used in a typical classification problem. But the most common data augmentation technique is known as Synthetic Minority Oversampling Technique or SMOTE for short.
As the name suggests, SMOTE creates “synthetic” examples rather than over-sampling with replacement. Specifically, SMOTE works the following way. It starts by randomly selecting a minority class example and finding its k nearest minority class neighbors at random. Then a synthetic example is created at a randomly selected point in the line that connects two examples in feature space.
The created synthetic examples from SMOTE for the minority class when added to the training set, balance the class distributions and cause the classifier to create larger and less specific decision regions helping the classifier generalize better and mitigate overfitting, rather than smaller and more specific regions which will cause the model to overfit to the majority class.
This approach is inspired by data augmentation techniques that proved successful in handwritten character recognition where operations like rotation and skew were natural ways to perturb the training data.
Now, let’s take a look at the performance of SMOTE.
From the confusion matrix we can notice a few things:
- The classifiers trained on synthetic examples generalize well.
- The classifiers Identify the minority class well (True Negatives).
- They have fewer False Positives compared to undersampling.
- It improves the overfitting caused by random oversampling as synthetic examples are generated rather than a copy of existing examples.
- No loss of information.
- It’s simple.
- While generating synthetic examples, SMOTE does not take into consideration neighboring examples that can be from other classes. This can increase the overlapping of classes and can introduce additional noise.
- SMOTE is not very practical for high-dimensional data.
In this approach, we reduce the number of samples from the majority class to match the number of samples in the minority class.
This can be done in a couple of ways:
- Random sampler: It is the easiest and fastest way to balance the data by randomly selecting a few samples from the majority class.
- NearMiss: Adds some common sense rules to the selected samples by implementing 3 different heuristics, but in this article, we will only focus on one.
- NearMiss-2 Majority class examples with a minimum average distance to three furthest minority class examples.
From the confusion matrix we can notice a few things:
- Undersampling performs poorly compared to oversampling when it comes to identifying the majority class (True Positive). But besides that, it identifies the minority class better than oversampling and has fewer False Negatives.
- Data scientists can balance the dataset and reduce the risk of their analysis or machine learning algorithm skewing toward the majority. Because without resampling, scientists might come up with what is known as the accuracy paradox where they run a classification model with 90% accuracy. On closer inspection, though, they will find the results are heavily within the majority class.
- Fewer storage requirements and better run times for analyses. Less data means you or your business needs less storage and time to gain valuable insights.
- Removing enough majority examples to make the majority class the same or similar size to the minority class results in a significant loss of data.
- The sample of the majority class chosen could be biased, meaning, it might not accurately represent the real world, and the result of the analysis may be inaccurate. Therefore, it can cause the classifier to perform poorly on real unseen data.
Because of these disadvantages, some scientists might prefer oversampling. It doesn’t lead to any loss of information, and in some cases, may perform better than undersampling. But oversampling isn’t perfect either. Because oversampling often involves replicating minority events, it can lead to overfitting.
“The combination of SMOTE and under-sampling performs better than plain under-sampling.”
To balance these issues, certain scenarios might require a combination of both over and undersampling to obtain the most lifelike dataset and accurate results.
How to handle imbalanced data – algorithm approach
This approach concentrates on modifying existing models to alleviate their bias towards the majority groups. This requires good insight into the modified learning algorithm and precise identification of reasons for its failure in learning the representations of skewed distributions.
The most popular techniques are cost-sensitive approaches (weighted learners). Here, the given model is modified to incorporate varying penalties for each considered group of examples. In other words, we use Focal loss where we assign a higher weight to the minority class in our cost function which will penalize the model for misclassifying the minority class while at the same time reducing the weight of the majority class, causing the model to pay more attention to the underrepresented class. Thus, boosting its importance during the learning process.
Another interesting algorithm-level solution is to apply one-class learning or one-class classification(OCC for short) that focuses on the target group, creating a data description. This way we eliminate bias towards any group, as we concentrate only on a single set of objects.
OCC can be useful in imbalanced classification problems because it provides techniques for outlier and anomaly detection. It does this by fitting the model on the majority class data (also known as positive examples) and predicting whether new data belong to the majority class or belong to the minority class(also known as negative examples) meaning it’s an outlier/anomaly.
OCC problems usually are practical classification tasks where majority class data is easily available but minority class is hard, expensive, and even impossible to gather, i.e. work of an engine, fraudulent transactions, intrusion detection for the computer system, and so on.
How to deal with imbalanced data – hybrid approach
Hybridization is an approach that exploits the strengths of individual components. When it comes to dealing with imbalanced classification data, some works proposed hybridization of sampling and cost-sensitive learning. In other words, combining both data and algorithm level approaches. This idea of two-stage training that merges data-level solutions with algorithm-level solutions (i.e. classifier ensemble), resulting in robust and efficient learners is highly popular.
It works by applying a data-level approach first. As you remember the data level approach works by modifying the training set to balance the class distribution between the majority class and the minority by using either oversampling or undersampling.
Then the pre-processed data with balanced class distribution is used to train a classifier ensemble, in other words, a collection of multiple classifiers from which a new classifier is derived which performs better than any constituent classifier. Thus, creating a robust and efficient learner that inherits the strong points of both data and algorithm level approaches while reducing their weaknesses at the same time.
From the confusion matrix we can notice a few things:
- The hybrid classifiers perform better than undersampling when it comes to identifying the majority class
- And, is almost as good as both undersampling and oversampling when it comes to identifying the minority class.
Basically takes the best of both worlds!
Performance measures for imbalanced classification
In this section, we review the common performance measures used and their effectiveness when addressing imbalanced classification data.
- Confusion matrix
- ROC and AUC
- Precision recall
May interest you
1. Confusion matrix
For binary classification problems, the confusion matrix defines the base for performance measures. Most of the performance metrics are derived from the confusion matrix, i.e., accuracy, misclassification rate, precision, and recall.
However, accuracy is not appropriate when the data is imbalanced. Because the model can achieve higher accuracy by just predicting accurately the majority class while performing poorly on the minority class which in most cases is the class we care about the most.
2. ROC and AUC imbalanced data
To accommodate the minority class, the Receiver Operating Characteristic (ROC) curve is proposed as a measure over a range of tradeoffs between the True Positive (TP) Rate and False Positive (FP) Rate. Another important performance measure is Area Under the Curve (AUC) is a commonly used performance metric for summarizing the ROC curve in a single score. Moreover, AUC is not biased towards the model’s performance on either the majority or minority class, which makes this measure more appropriate when dealing with imbalanced data.
3. Precision and recall
From the confusion matrix, we can also derive precision and recall performance metrics.
Precision is great for class imbalance and it’s not affected by it because it doesn’t include the number of True Negatives in its calculation.
One drawback of precision and recall is that like accuracy there might be some imbalance between the two where we want to improve TP for the minority class, however, the number of FP can also increase.
To balance the recall and precision, i.e., improving recall, while keeping precision low, the F-score is proposed as a harmonic mean of the precision and recall.
Since the F-score weights, precision, and recall equally and balances both concerns, it is less likely to be biased to the majority or minority class. 
Check this experiment with the 3 imbalance classification approaches code examples in the Colab notebook I prepared for you.
Imbalanced regression data
Regression over imbalanced data is not well explored. And, many important real-life applications like the economy, crisis management, fault diagnosis, or meteorology require us to apply regression over imbalanced data which means predicting rare and extreme continuous target values from input data.
Because dealing with imbalanced data is a relevant problem that has been studied mostly in the context of classification tasks, there are scarce mature or suitable strategies to address it in the context of regression.
Let’s first look at the typical approaches adopted from Imbalanced Classification then we will look into some of the best Imbalanced Regression techniques currently being used.
Approachas adopted from imbalanced classification
When it comes to data approaches for imbalanced regression we have two techniques that were heavily inspired on imbalanced classification:
SMOTER is an adaptation for regression of the well-known SMOTE algorithm.
It works by defining frequent (majority) and rare (minority) regions using the original label density and then applying random undersampling to the majority region and oversampling to the minority region, where the user has to pre-determine the percentage of over and undersampling to be carried out by the SMOTER algorithm.
When it comes to oversampling the minority regions it not only generates new synthetic examples it also applies an interpolation strategy that combines inputs and targets of different examples. Precisely, this interpolation is carried out using two rare cases where one is a seed case and the other is randomly selected from the k-nearest neighbors to the seed. The features of the two cases are interpolated, and the new target variable is determined as a weighted average of the target variables of the two rare cases used.
Why do we have to average the target variables you might ask? Remember that in the original SMOTE algorithm, this was a trivial question, because as all rare cases have the same region (the target minority region), but in the case of regression the answer is not so trivial because when a pair of examples are used to generate a new synthetic case, they will not have the same target variable value.
SMOGN takes after SMOTER but it further adds Gaussian Noise to the oversampling phase alongside the one SMOTER already has.
The key idea of SMOGN algorithm is to combine both SMOTER and Gaussian Noise strategies for generating synthetic examples to simultaneously limit the risks that SMOTER can incur such as lack of diverse examples by using the more conservative strategy of introducing Gaussian Noise because SMOTER will not use the most distant examples in the interpolation process. It works by generating new synthetic examples with SMOTER only when the seed example and the k-nearest neighbor selected are close enough and using the Gaussian noise when the two examples are more distant.
Like in imbalanced classification this approach also includes adjusting the loss function to compensate for region imbalance (re-weighting) and other relevant learning paradigms such as transfer learning, metric learning, two-stage training, and meta-learning . But we will focus on the first 2 paradigms:
- Error-aware loss
- Cost-sensitive re-weighting
1. Error-aware loss
It is the regression version of the Focal Loss for classification called Focal-R. Focal loss is a dynamically weighted cross-entropy loss, where the weighting factor(alpha) decays to zero as confidence in the correct class increases.
Focal-R replaces the weighting factor by a continuous function that maps the absolute error(L1 distance) into values in the range of 0 to 1.
Precisely, Focal-R loss based on L1 distance can be written as:
Where ei is the L1 error for i-th sample, σ(·) is the Sigmoid function, and β, γ are hyper-parameters.
2. Cost-sensitive re-weighting
Since the target space can be divided into finite bins, classic re-weighting schemes can be directly plugged in, such as inverse-frequency weighting(INV) and its square-root weighting variant(SQINV) both of which are based on the label distribution.
It takes after the hybrid approach for imbalanced classification.
Like the hybrid approach for imbalanced classification, the imbalanced regression hybrid approach also combines data level and algorithm level approaches in order to produce robust and efficient learners.
An example of this approach is the Bagging-based ensemble.
This algorithm incorporates data pre-processing strategies for addressing imbalanced domains in regression tasks.
Precisely, a paper entitled “REBAGG: REsampled BAGGing for Imbalanced Regression” proposes an algorithm that obtains diversity on the generated models while simultaneously biasing them towards the least represented and more important cases.
It has two main steps:
- Build a number of models using pre-processed samples of the training set.
- Use the trained models to obtain predictions on unseen data by applying an averaging strategy (basically averaging models’ predictions to obtain the final predictions).
Regarding the first step, the authors developed four main types of resampling methods to apply to the original training set: balance, balance.SMT, variation, and variation.SMT. The key distinguishing feature of these methods is related with:
i) the ratio between the number of minority and majority examples used in the new sample; and,
ii) how new minority examples are obtained.
On the resampling methods labeled with the prefix “balance”, the new modified training set will have the same number of minority and majority examples. On the other hand, for resampling methods with the prefix “variation”, the ratio of minority to majority examples in the new training set will vary.
When the resampling method has no suffix appended, then the new synthetic examples for minority region are obtained by using exact copies of randomly selected minority examples. And when the suffix “SMT” is appended the new synthetic examples for the minority region are obtained using the SMOTER algorithm.
Deep Imbalanced Regression (DIR)
The methods adopted from imbalanced classification work; however, there are several drawbacks to using them alone.
Allow me to make a case!
The above datasets have intrinsically different label spaces (a) CIFAR-100 exhibits categorical label space where the target is a class index while (b) IMDB-WIKI exhibits continuous label space where the target is age.
As you can see the label density distribution is the same for both but the error distribution is very different. The error distribution for IMDB-WIKI is much smoother and doesn’t correlate well with the label density distribution and this affects how imbalanced learning methods work because directly or indirectly, they operate by compensating for the imbalance in the empirical label density distribution. This approach works well for imbalanced classification but not for continuous labels. Instead, you have to find a way to smooth the label distribution.
Label distribution smoothing (LDS) for imbalanced data density estimation
From figure 2 above we can see that in the continuous space empirical label distribution does not match the real label density distribution. Why is this? Because of the dependence between data samples at nearby labels, in this case, we are talking about images of close age.
LDS uses kernel density estimation to learn the effective imbalance in datasets that corresponds to continuous targets. Precisely, LDS convolves a symmetric kernel with the empirical density distribution to extract a kernel-smoothed version that accounts for the overlap in the information of data samples of nearby labels.
Note: Gaussian or a Laplacian kernel is a symmetric kernel.
The symmetric kernel characterizes the similarity between target values y’ and y w.r.t their distance in the target space.
Figure 2 at the beginning of this section shows that LDS captures the real imbalance that affects regression. By applying LDS we get a label density distribution that correlates well with error distribution (-0.83).
Once you have the effective label density, you can then use the adapted techniques for addressing imbalanced classification that we talked about earlier (i.e. cost-sensitive re-weighting method).
Feature distribution smoothing (FDS)
The above figure displays the feature statistics similarity for age 30 (anchor). And you can right away notice that the bins that surround the anchor are highly similar to the anchors, especially the closest ones. But examining the figure further you will notice that there is a problem with regions with very few data samples (i.e. age 0-6 years). Due to data imbalance, the mean and variance show an unjustified high similarity to age 30.
The creators of the Feature distribution smoothing (FDS) algorithm were inspired by these observations and proposed this algorithm that performs distribution smoothing on the feature space, or in other words, transfers feature statistics between nearby target bins. Thus calibrating the potentially biased estimates of feature distribution, especially for underrepresented target values in training data.
And one great thing about FDS is that you can integrate it into deep neural networks by inserting a feature calibration layer after the final feature map.
Results reported on the Semantic Textual Similarity Benchmark (STS-B-DIR) dataset using various algorithms.
The authors show that when LDS and FDS are coupled with other existing methods to address regression over imbalanced data significantly improves the performance .
Performance measures for imbalance regression
When it comes to evaluation metrics for this kind of problem, you can use the common metrics for regression such as MAE, MSE, Pearson, Geometric Mean(GM) alongside the techniques we explored in this section.
Crucial open issues to address when developing novel methods for Imbalanced regression
- Development of cost-sensitive regression solutions that can adapt the cost to the degree of importance assigned to rare observations. To allow for more flexibility in predicting rare events of differing importance it would be rather interesting to investigate the possibility of adapting the cost not only to the minority group but to each individual observation.
- Methods that will allow distinguishing between minority and noisy samples must be proposed.
- Development of better ensemble learning methods as in classification may offer a significant improvement in both robustness to skewed distributions and predictive power.
Canonical ML algorithms assume that the number of objects in considered classes is roughly similar. However, in many real-life problems that we can apply ML to, the distribution of examples is skewed since the events that we care the most about and want to predict happen rarely and for the most part, we collect data points of normal events which represent the normal state and majority group. This poses a difficulty for learning algorithms, as they will be biased towards the majority group.
But in this article, you learned about the different approaches to learning from imbalanced classification and regression data.
Thank you for reading! And as always I have a well-researched reference section that you can use to dive deeper into what you read below as well as a colab notebook.
- Deep Imbalanced regression
- Hybrid Classifiers—Methods of Data, Knowledge, and Classifier Combination. In: Studies in Computational Intelli- gence, vol. 519. Springer, Berlin (2014)