If machine learning solutions were cars, their fuel would be data. Simply viewed, ML models are statistical equations that need values and variables to operate, and the data is the biggest contributor to ML success.
Today, sources of data are ample, and the amount of available data keeps growing exponentially. This allows us to wrangle, bend, and choose the right set of data from a heap of unnecessary noise.
But what happens when you have less data and it seems it’s not enough to train the model? Well, can a car run many miles on low fuel? The answer is both yes and no, depending on the type of model and the quality of the fuel. If the answer is “yes”, the next question is: how is the car so energy efficient?
Likewise, how do you build ML solutions that are data-efficient?
To understand that, let’s start by discussing the causes of insufficient data.
Causes of insufficient data
- Exclusive data type
Even though the growth in data generated every day is exponential, some exclusive types of data still have limited sources. For example, rare disease data or accident stats from fully self-driven cars (new tech). For such data, the options to expand are limited, and the existing data is all there is to try and pump out value.
- Exclusive client data
Even if the data isn’t exclusive in nature, the client may be. For example, a client’s financial data can be limited if it’s a new organization.
- Corrupted or malfunctioning data source
If the source of the data or the devices recording the data is faulty, it might lead to missing values. For example, if you have a dataset that checks the vitality of a particular machine, the information may be limited if some sensors failed midway in the machine’s lifecycle and weren’t fixed due to a high cost-to-usefulness ratio.
- Imbalanced data
The most common issue with real-life datasets is the imbalance between different classes of data. You can have a huge dataset with a million positive records, but if the dataset has only a hundred records of negative samples, the model will have a hard time learning the difference between the two classes. A few instances of minority classes in imbalanced data include fraud cases with large debits transactions or off timings in bank data, security anomalies, failure instances in predictive maintenance data, and positive cases in data for rare diseases.
These are broad examples of data shortages, but there can be many more reasons for not having enough data for your model. How can we solve this issue when it happens?
Bias variance tradeoff
When you have very little data, the training data is like a limited resource. So, not to disrupt the whole modeling process, the fundamental concept of Bias and Variance must be thoroughly considered to minimize the error between original and predicted values.
Bias is the assumption used by machine learning models to make the learning process easier. For example, a linear regression model assumes that there is a linear correlation between the dependent and independent variables.
High bias models are the ones that make significant assumptions about the data. For example, linear models like linear regression, K-Means, and logistic regression.
Low Bias models make fewer assumptions about the data and are better at learning new types of relationships between the target and independent variables. For example, non-linear models like decision trees, random forest, and K-NN algorithms.
Variance is the measure of how much the estimated target variable will change with a change in training data. In more simple terms, it measures the stability of the trained models. If there’s a lot of change, it means that the model isn’t learning the patterns, but instead learning the noise in the training data.
High variance models are sensitive to changes in data and can show major differences across different training sets. This is why they need to be trained and tested in iterations with multiple folds of data. For data with low volume, the folds don’t have to be exclusive and can even contain data points from other folds. Examples of high variance models are decision trees and random forests.
Low variance models are less sensitive to changes in training data, mostly because of fixed assumptions about the data. Examples are linear regression and linear discriminant analysis.
The ultimate goal, as is clear from the definitions of bias and variance, is to reduce both bias and variance to a minimum – a state of high stability with few assumptions.
Bias and variance are indirectly proportional to each other, and the only way to reach a minimum point for both is at the intersection point. The calculation of bias and variance isn’t possible since we’re unaware of the target function. However, selecting and tuning the model with these concepts in mind can reduce the test error a lot (usually the mean squared error between actual and estimated values).
Overfitting and underfitting
With respect to any volume of data, there’s the risk of overfitting and underfitting if the models aren’t handled carefully. However, when the quantity of data is small, the more likely scenario is overfitting. Developers may want to maximize the available data, and in doing so, mistakenly make the model learn the noise and deviations as well.
Overfitting is when the model is trained to stick too closely to the training data. On a high level, instead of considering the training data to be an approximation, the model considers it to be absolute. Therefore, when a model is overfitting on a set of training data, it fails to perform on new and unseen sets of data.
Overfitting is possible mostly with non-linear and flexible models that can closely follow data with high variance. For example, a random forest algorithm can branch out numerous times to accommodate a wide range in data.
Underfitting is when the model doesn’t learn well from existing data. This can happen due to many reasons, but mostly in low-volume data if the data has high variance and the chosen model has high bias. For example, using linear regression on data with insignificant correlation with the output variable.
How to avoid overfitting on small datasets
As discussed above, the primary issue with small volumes of data is overfitting. Let’s look into how this issue can be resolved.
- Low model complexity
Models with low complexity are less flexible and less accommodative to high-variance data. A model with high complexity will try to note the smallest deviation of data and tend to learn the noise as well. Models with low complexity can include linear models like logistic regression and linear regression.
However, if the data isn’t linear in nature, these models won’t perform well. In such cases, non-linear models like decision trees and random forests are recommended. To restrict the complexity of these tree-based algorithms, the parameters can be fine-tuned. For example, after limiting the max_depth parameter, the tree grows only until a certain level and stops recording further data deviations.
- Outlier removal
Outliers influence the trajectory of learning significantly if you don’t treat them properly. When dataset size is already low and the model has little to learn from, coming across outliers can sway the target function incorrectly, especially if the model has high variance. Outlier removal is not recommended when the outliers indicate key events like a fraud or hack.
- Optimal feature engineering
Here, feature engineering broadly defines both feature selection and feature creation. With less data, the ability to create multiple useful features from the limited raw columns can literally be the game-changer. Once I came across a project which had only four raw columns (3 of them being dates), and the team was able to design around 50 new features from those four columns!
Raw columns are simplistic data and can serve greater value when intelligently interpreted. For example, while predicting the price of a share from stock data, the simple feature of the number of trades is often less intuitive compared to say, the rate of change in the quantity of trades or differences over time.
One case where feature engineering doesn’t have to be extensive, even when the number of raw columns is less, is when a neural network is being used. This is because neural nets create and select features by themselves.
- Ensemble learning
Overfitting results from high variance, and ensemble learning is a way to reduce the variance of the model without compromising the complexity of the model. It’s a unique way to reach the bias-variance trade-off. By definition, ensemble learning combines several models to compute the combined results from each model. This means that you can use a combination of high bias and high variance models to get the best of both worlds!
Ensemble learning comes in three major variants: bagging, boosting, and stacking. If you’re interested, there’s a fourth type of ensemble called cascading—it deals with highly imbalanced data. All the different types of ensemble learning are techniques through which the models are combined.
The issue with overfitting is that with the change of training data, the performance entirely changes on test data. Ideally, with different sets of training data, the performance should remain consistent, indicating that the model has been able to pick up the underlying data patterns by filtering out the noise.
This is where cross-validation comes into the picture to check the consistency of the trained model. As the name suggests, cross-validation is a validation technique that iteratively tests the model on different folds of validation data.
The most commonly used technique is k-fold cross-validation. First, we divide the training dataset into k folds. For a given hyperparameter setting, each of the k folds takes turns being the hold-out validation set; then, a model is trained on the rest of the k – 1 folds and measured on the held-out fold.
- Transfer Learning
As discussed under the causes of insufficient data, some data types have restricted sources, and therefore restricted volume. For example, if someone wants to train a model to predict the replies to customer care calls, they need to train the model on the limited records of past conversations.
This is where transfer learning can be very useful. Since the conversation records of the organization’s customer care unit are limited, there’s not enough data for the models to learn the sense of grammar or interpret new words that new customers might type or say. Therefore, models that have already been trained on the concrete corpus (collection of words) can be integrated with the new model. This is how learnings from pre-existing models are “transferred” to new solutions. This is just one high-level example. The concepts and applications of transfer learning are huge.
How to expand the dataset
Working efficiently on less data with minimal overfitting is commendable, but there are definitive ways to expand the size of the dataset in most cases.
- Synthetic Data Augmentation
When data is generated from existing data through various techniques like approximation, it’s called synthetic data augmentation. The synthetic data is similar to the original data, yet provides enough variance for the machine learning model to learn the data trends.
SMOTE or Synthetic Minority Oversampling Technique is another way to get rid of overfitting by overcoming data imbalance. Another way to remove data imbalance is to undersample the majority class, but that leads to loss of data which is not favorable, especially when the volume of data is smaller.
Unlike random oversampling techniques, SMOTE focuses on the minority class and creates new data points similar to it through interpolation between neighboring points. SMOTE uses the K-Nearest Neighbors algorithm to identify the neighbors.
- Borderline SMOTE
One clear disadvantage of SMOTE is that it tends to create ambiguous data points near the borderline of two different classes. This happens when, say, X13 and X12 (refer to the above diagram) are from class A while X14 and X11 are from class B. This is a case when the classification of point X1 is complicated instead of eased.
Borderline SMOTE, therefore, considers the difficult points near the borderline that get misclassified and generates more samples only in this section to make the distinction clearer through volume.
ADASYN or Adaptive Synthetic Sampling Approach is another form of SMOTE and also oversamples the minority class. However, unlike both SMOTE and borderline SMOTE, ADASYN considers the density of the data distribution and accordingly decides how many data points need to be generated in a particular region.
Therefore, where the density is low in feature space, ADASYN will generate more samples and vice versa. The diagram below shows that for every new point being created, ADASYN considers the relative densities of majority and minority classes within an area of 8 neighbors. The number of data points created is directly proportional to density.
GANs (Generative Adversarial Networks) are a fascinating invention in the field of AI. Most news articles related to machines achieving splendid human-like tasks are about the work of GANs.
In the real world, you might have noticed that living beings often grow stronger in the face of adversity or competition. A seed that beats its siblings and neighboring seeds grows into a strong and healthy tree, increasing the chances of stronger descendants as the generations proceed. Similarly, a boxing contender estimates the quality of his/her fellow contenders to prepare accordingly. The stronger the contenders, the higher the quality of preparation.
Almost mimicking the real world, GANs follow suit and go by the architecture of adversarial training. Adversarial training is nothing but “learning by comparison”. The samples available are studied such that other instances of the samples from the same category can be created by the GAN. The objective of the GAN is to create these samples with the least flaws so that they’re indistinguishable from the original dataset. Example: A GAN might create a new Vincent van Gogh style painting, such that it looks like it’s been created by the painter himself (GANs are not so advanced yet, but this is the underlying concept).
There are two counterparts in a GAN: the generator and the discriminator. The work of the generator is to tweak the outputs of the GAN such that they’re indistinguishable from the originals. The work of the discriminator is to judge if the generated piece is up to mark and can be classified on par with the already present samples. Otherwise, the cycle is repeated unless the generator generates a good enough sample which the discriminator fails to differentiate from the original sample set.
May interest you
- Variational Autoencoder
Variational autoencoders are deep neural networks that rely on probability to create new samples, especially for imbalanced data. A brilliant example used by Jeremy Jordan conveys the function of variational autoencoders (VAE) through simplified intuition.
In the example, pictures of faces need to be generated such that the new faces are indistinguishable from the original sample set. So, for every new image that the VAE encounters, it creates a probability distribution of every feature of that image. For example, the probability distribution of the person smiling, or the probability distribution of the hair being straight, etc.
Once a set of such distributions is available, the VAE samples randomly from the distributions (random feature value selected from the distribution) to reconstruct the image, therefore creating multiple variations of the same image within the probability range.
- Data Pooling
Data pooling means collecting and adding relevant data to the existing dataset. This is a sensitive process and needs to be carefully approached:
- Understand data deficiencies
To pool relevant data, you need to understand the existing type of deficiency in the data. For example, if someone is having a vitamin deficiency, the best way is to find out which vitamin is deficient and once identified, take the supplement for that particular vitamin only.
Similarly, the type of deficiency in the data is a crucial stage before venturing out to find the missing pieces. Does the data have missing values, or is it imbalanced and has a shortage of minority data points? Does the data lack some important feature or need more rows/data points? Such questions and more have to be considered before searching for more data that can be integrated into the existing set.
- Understanding statistical deficiencies
Once the problem of concrete deficiencies like missing data and insufficient data points or features is resolved, one more step is needed to identify the deficiency in the data patterns.
For example, if the height of a class of 10 students is collected, it’s not enough to create a distribution. Given the size of the data, even after oversampling, it’s possible that the data might not fit any distribution because, say, the class happened to have a majority of tall students. However, we already have a lot of open-source height data and as observed, they tend to have a Gaussian distribution.
Therefore, to be able to fit the sample set to the appropriate distribution, you need to identify a similar data source and sample carefully with the least bias.
- Identifying data sources
Like in the above height example, identifying similar data sources can be instrumental in expanding the dataset without causing much harm to the integrity of the sample set. Most data isn’t exclusive, and most data is useful and has an innate purpose of providing insights.
- Understanding business limitations and expanding accordingly
One limitation to expansion through external data sources is that it might not comply with the business needs. Consider two clients and their datasets that look similar in terms of distribution and features, but can’t be integrated because of completely different processes that generated the data. This limits future disruptions due to changing business processes. Also, this means that there has to be a logical reasoning behind why two datasets are being integrated.
- Understanding compliance limitations
Data pooling can also raise security issues. Therefore, when dealing with real-life solutions that go into production, developers need to be careful about things like where the data comes from, or if the existing data is confidential. For example, an organization may have 10 exclusive clients with similar data, but if each client has made the organization sign a confidentiality agreement, no data can be interchanged for optimal model building.
Some non-technical, business-oriented ways can also help the data teams to collect more data for expanding existing datasets.
- Creating free applications to collect data
Creating a useful application can help bring in a lot of data, especially if the application is free. For example, an application that records the sleeping patterns of users to help them sustain a better sleep cycle can be used by the organization to understand the underlying causes of poor sleep patterns.
- Running a survey
If you’ve been on free quizzing sites for fun, you must have come across optional data surveys. Even the quiz data, if appropriately framed, can be used to record and expand on the existing data. Often, huge research projects use such data from multiple sites to generate a huge corpus to study mass patterns. Other ways to send surveys are through emails or active advertisements on targeted channels.
- Purchasing data
Some popular sites and channels collect and save a lot of data which they then sell to organizations in need. A popular example of this is Facebook which tracks user behavior and sells this data to advertisers so they can target the user whose behavior matches a particular consumer pattern. One thing that needs to be taken care of here is to exchange data only with credible purchasers or sellers.
Every other day, there are more advanced techniques (GANs being a fairly recent one) that are replacing the older techniques, and there’s no end to the possibilities and methods of data expansion or optimization of existing data.
We’ve explored several techniques, and I recommend that you go even deeper and learn all you can about the techniques you like the most.
Note: I used self-citation from GANs in Healthcare