MLOps Blog

Overfitting vs Underfitting in Machine Learning: Everything You Need to Know

10 min
4th August, 2023

We live in a world where data dictates a lot of our activity. Some say data is the new fuel. Data doesn’t only tell us about the past. If we model it carefully, with accurate methods, then we can find patterns and correlations to predict stock markets, generate protein sequences, explore biological structures like viruses, and much more. 

Modelling a heavy amount of data like this manually is a tedious job. To achieve this, we have turned to machine learning algorithms that can help us extract information from data. ML and DL algorithms are in big demand, they can solve complex problems that are impractical or even impossible to do manually, and learn the distributions, patterns, and correlations uncovering the knowledge within data. 

Algorithms do this by exploring a dataset and creating an approximate model over that data distribution, such that when we feed new and unseen data it will produce good results. 

One of the problems in machine learning is that we want our algorithm to perform well with training data (i.e. the data that is fed into the algorithm for modelling) as well as new data. This is known as generalization. We want our algorithm to be good at generalization.

However, there’s still a number of concerns: 

  1. It can be quite challenging to understand what an algorithm is really doing. 
  2. Let’s say we’ve produced a model, but it fails to yield good results – it’s challenging to understand what went wrong and how to fix it. 

A general problem in machine learning is when the algorithm performs well on the training dataset, but fails to perform on the testing data or the new data, and it’s unable to model the given distribution. 

Why is that?

Well, there can be a number of reasons regarding the performance of the algorithm. The factors that determine the performance of a machine learning algorithm are its ability to:

  1. Make the training error small as possible.
  2. Make the gap between the training and testing error small. – (Deep Learning Book, Ian Goodfellow; 2014)

These two factors correspond to the two central challenges in machine learning: underfitting and overfitting. Underfitting is when the training error is high. Overfitting is when the testing error is high compared to the training error, or the gap between the two is large. 

While it’s challenging to understand the workings of big, complex ML and DL models, we can start by understanding the workings of small and simple ones, and work our way up to more complexity. 

Note: The code for all topics in this article is in this colab notebook

Model basics

A machine learning algorithm, or deep learning algorithm, is a mathematical model that uses mathematical concepts to recognize or learn a certain type of pattern or correlation from a dataset. What do we mean by learning? 

Let’s consider the equation y = mx + b, where m and b are the parameters, and x is the input. 

In machine learning, learning refers to an optimization process where the parameters of the model update themselves using a loss function at every iteration of the training, such that the equation fits the distribution perfectly or approximately. See the image below. 

Some models try to find a relationship between the input and output through a function, while others try to group similar data together, or find relatable patterns. The first is supervised learning, and the second is known as unsupervised learning

In both supervised and unsupervised learning, the model needs to learn the important features, which it can leverage to produce good results. A couple of factors can help it to achieve that: capacity and architecture. 

Model capacity and architecture

Conceptually, capacity represents the number of functions (linear or nonlinear) that a machine learning algorithm can select as an optimal solution. The performance of the machine learning algorithm depends on its capacity. A good rule of thumb is that the capacity of the model should be proportional to the complexity of its task and the input of the training data set. 

Machine learning models with low capacity are more than useless when it comes to solving complex tasks. They tend to underfit. Likewise, models with higher capacity (than needed) are most likely to overfit

Essentially, model capacity represents a measure by which we can estimate whether the model is prone to underfit or overfit.

Let’s understand this with some code:

import numpy as np
import matplotlib.pyplot as plt

First, we’ll create a dataset:

X = np.sort(np.random.rand(100))

Then we’ll define a true function. A true function in supervised learning is a function that has already mapped the input to the output. This will help us evaluate how correctly our machine learning algorithm models the distribution. 

true_f = lambda X: np.cos(3.5 * np.pi * X)

Then we’ll define a y, which is an output of the true function:

y = true_f(X) + np.random.randn(100) * 0.1

When we plot X against y, our data distribution will look something like this:

Data distribution

We want our machine learning model to be as close as the true function. 

Model basics

Now let’s define our machine learning model:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

We’ll use the term ‘degrees’ in order to address capacity. A model with degree 1 will have low capacity, compared to model with a degree 15. 

degrees = [1,15]

In this loop, we’ll iterate the model with two degrees, 1 and 15. This will show us how capacity can help us achieve a good model that can yield good results. 

plt.figure(figsize=(15, 10))
for i in range(len(degrees)):
   ax = plt.subplot(1, len(degrees), i+1)
   plt.setp(ax, xticks=(), yticks=())

   polynomial_features = PolynomialFeatures(degree=degrees[i],
                                               include_bias=False)
   linear_regression = LinearRegression()

   #creating a structure for operation
   pipeline = Pipeline([("polynomial_features", polynomial_features),
                           ("linear_regression", linear_regression)])

   pipeline.fit(X[:, np.newaxis], y)

   #Testing
   X_test = np.linspace(0, 1, 100)
   yhat = pipeline.predict(X_test[:, np.newaxis])
   plt.plot(X_test, yhat,label="Model")
   plt.plot(X_test, true_fun(X_test), label="True function")
   plt.scatter(X, y, label="Samples")
   plt.xlabel("x")
   plt.ylabel("y")
   plt.xlim((0, 1))
   plt.ylim((-2, 2))
   plt.legend(loc="best")
   plt.title("Degree %d" % degrees[i])
plt.show()
Model capacity

From the above image, we can see that capacity plays an important role in building a good machine learning model. The first image on the left underfits, while the image on the right fits perfectly over the distribution. 

Keep in mind that if the capacity of the model is higher than the complexity of the data, then the model can overfit

Overfitting model

Architecture is a combination of multiple levels of nonlinear functions that models a given set of data distribution. Different models have different architecture, depending on the data that you’re working with. Especially when you’re developing a deep learning model, or working with a non-parametric model, like random forest or XGBoost. 

Overfitting or underfitting can happen when these architectures are unable to learn or capture patterns. 

Datasets 

In a typical machine learning scenario, we start with an initial dataset that we use to separate and create training and testing datasets. Conventionally, we use 80% of the dataset to train the model, and save 20% to test it. 

During the training phase, our model will certainly deviate from the training data. This is often referred to as the training error. Similarly, the deviation produced during the test phase is referred to as test error. Our job as the machine learning practitioner should be to:

  1.  Reduce the training error – underfitting. 
  2.  Reduce the gap between the training and test error – overfitting.

Keeping that in mind, we take into consideration two concepts that are really important when it comes to datasets:

  1. Variance and bias
  2. Data-leakage

Let’s explore them in detail.

Variance and bias

Variance and bias are some of those concepts that most of us forget to consider. What do they mean?

Bias usually refers to the rigidity of the model. Consider a dataset with a nonlinear property (something that we saw in the beginning). In order to capture the pattern, we need to apply a machine learning algorithm that’s flexible enough to capture a nonlinear property. If we apply a linear equation, then we say that the machine learning model has high bias and low variance. In simple words, high-biased models are rigid to capture the complex nature of the data.

Let’s define a nonlinear function that captures the true features or representation of the data, and a simple linear model. 

non_linear_func = lambda X: np.cos(3.5 * np.pi * X)
Simple_Model = 2**(X)
Overfitting underfitting bias

As you can see from the image above, the simple model is:

  1. Too rigid, too simple to capture nonlinear representation.
  2. High bias and low variance.
  3. Producing high error on training set – underfitting.

Variance, on the other hand, is when the algorithm during training tries to model the distribution so precisely that it captures the position of each and every data-point. As a result, the model is too flexible and too complex. In such cases, the model has high variance and low bias. 

Overfitting underfitting variance

As you can see from the image above, the complex model has:

  1. Tendency to capture noise from the dataset.
  2. High variance and low bias.
  3. High error on the test set – overfitting.

When it comes to overfitting, the decision tree algorithm is very much prone to it. 

What is data-leakage and does it affect model performance?

One of the problems that leads to overfitting is data leakage. 

It happens when the information of the training set is transferred into the testing set. During the final evaluation with the testing set, the model performs well in the same. This overestimation of the model can be very misleading before the model is deployed. There are chances that it will perform poorly. 

In order to avoid data leakage, it’s better to separate the training dataset and testing dataset before doing any feature engineering. This way, the information is not transferred to the testing dataset, and the true performance of the model can be measured during the testing. 

READ LATER

The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

How to overcome overfitting and underfitting in your ML model?

We’ll discuss six ways to avoid overfitting and underfitting:

  1. Introduce a validation set,
  2. Variance-bias tradeoff,
  3. Cross-validation,
  4. Hyperparameter tuning,
  5. Regularization,
  6. Early stopping.

Validation set

Validation dataset is used to provide an unbiased evaluation after training the model on the training dataset. It’s helpful during the design iteration of the architecture, or hyperparameter tuning of the model. In both situations it improves model performance before it’s deployed, and makes sure that the model generalizes well on testing data. 

It’s important to remember that training and validation data is used during the training to check overfitting and underfitting. Test data is used to make sure that the model is generalizing well. 

A good practice is to create a training set and testing set from the entire data, followed by creating a training set and validation set from the earlier separated training set. 

Validation dataset

Variance-bias tradeoff

We already know what variance and bias are. They can play an important role for building a good machine learning model, or even a deep learning model that yields good results. Essentially, we can’t have both variance and bias in extremely high quantities. We have both in limited quantities. 

“One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error” – (Hands-On Machine Learning with Scikit-Learn and TensorFlow,  Aurélien Géron; 2019)

Variance-bias tradeoff is basically finding a sweet spot between bias and variance. We know that bias is a reflection of the model’s rigidity towards the data, whereas variance is the reflection of the complexity of the data. High bias results in a rigid model. As we increase the capacity, the model tends to increase its flexibility by reducing the rigidity. Essentially, we’re transforming an underfitted model towards a statistically good fit model by increasing the capacity. 

A good practice is to check the training error and validation error. Because error = bias + variance. If both errors are less and close to each other, then the model has a good fit.  

Variance-bias tradeoff

The image on the left shows high bias and underfitting, center image shows a good fit model, image on the right shows high variance and overfitting. 

Cross-validation

Cross-validation helps us avoid overfitting by evaluating ML models on various validation datasets during training. It’s done by dividing the training data into subsets. So how does cross validation help a model?

Well, we want the model to learn the important features and patterns during the training. Cross-validation splits the data such that the validation data is representative of both training and the data from the real-world scenario. This helps the model generalize well and yield good results.

Overfitting underfitting cross-validation
Source: Approaching (Almost) Any Machine Learning Problem (2020), Abhishek Thakur

Let’s see an example. 

Firstly, we’ll build four decision trees. Each decision will have different max-depth. Next, we’ll train them:

from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2)
tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3)

tree_reg3 = DecisionTreeRegressor(random_state=42, max_depth=5)
tree_reg4 = DecisionTreeRegressor(random_state=42, max_depth=12)
tree_reg1.fit(X_train, y_train)
tree_reg2.fit(X_train, y_train)
tree_reg3.fit(X_train, y_train)
tree_reg4.fit(X_train, y_train)

Now, lets see the accuracy of all the four models on both training and testing datasets:

The training and testing scores of model 1: 0.7732058844148597 and 0.770360248173112,
The training and testing scores of model 2: 0.8523996532650688 and 0.8476275950133408,
The training and testing scores of model 3: 0.8964495771468475 and 0.8907512124389504,
The training and testing scores of model 4: 0.9254890162488267 and 0.8895815575629907

Two things we can observe:

  1. As the max depth increases, the training accuracy increases.
  2. As the max depth increases, the difference between the training and the testing accuracy also increases – overfitting. 

In order to fix that, we will use k-fold cross validation to create subsets from the training set. k-fold cross-validation splits the dataset into ‘k’ number of folds, then uses one of the ‘k’ folds as a validation set, and the other k-1 folds as a training set. 

This process is repeated k times, such that each of the k folds is used once as the test set. The scores obtained from this k times training and testing are then averaged to obtain the final score. 

for index, (train, test) in enumerate(fold.split(X_train,y_train)):
   X_train_folds = X_train[train]
   y_train_folds = y_train[train]

   X_test_folds = X_train[test]
   y_test_folds = y_train[test]

   tree_reg1.fit(X_train_folds, y_train_folds)
   tree_reg2.fit(X_train_folds, y_train_folds)
   tree_reg3.fit(X_train_folds, y_train_folds)
   tree_reg4.fit(X_train_folds, y_train_folds)

The purpose of k-fold is to help the model generalize well on testing data. 

Fold 1
Accuracy Comparison on model 1 :  0.7664370565884211 0.7801300087611103
Accuracy Comparison on model 2 :  0.8485031490397249 0.8586081582213081
Accuracy Comparison on model 3 :  0.8950440772346971 0.9007301852045746
Accuracy Comparison on model 4 :  0.9268552462895857 0.8978944174232537

Fold 2
Accuracy Comparison on model 1 :  0.7671249512433342 0.7687678014811595
Accuracy Comparison on model 2 :  0.8497676129534959 0.8515991797911563
Accuracy Comparison on model 3 :  0.8970919853597747 0.8931467178250443
Accuracy Comparison on model 4 :  0.9283195789947759 0.8911095249603449

Fold 3
Accuracy Comparison on model 1 :  0.7735518731532391 0.7684962516765577
Accuracy Comparison on model 2 :  0.8535462998699248 0.8470155912448611
Accuracy Comparison on model 3 :  0.8969106026960184 0.8898887269256492
Accuracy Comparison on model 4 :  0.9288963915724866 0.8884304629263801

Fold 4
Accuracy Comparison on model 1 :  0.7738322196681096 0.7753768159905526
Accuracy Comparison on model 2 :  0.8536239983149718 0.8512559589603865
Accuracy Comparison on model 3 :  0.8968186364805686 0.8931328656292392
Accuracy Comparison on model 4 :  0.9280796541851367 0.891684128138715

Fold 5
Accuracy Comparison on model 1 :  0.7733590419579685 0.750509982451151
Accuracy Comparison on model 2 :  0.8518211510105747 0.8362310647486868
Accuracy Comparison on model 3 :  0.8977214861124465 0.8890623271523825
Accuracy Comparison on model 4 :  0.9290267746532016 0.8859361597163452

As we can see, overfitting is reduced to some extent. So far, so good. 

But, we have to keep in mind that cross-validation is purely based on how we make a good subset for the training and evaluation phase. It gives us a starting point, but we have to learn other methods to optimize our model as well. 

Let’s discover other methods to further reduce overfitting, while increasing model performance. 

READ ALSO

Cross-Validation in Machine Learning: How to Do It Right

Hyperparameter tuning

When creating a (good) ML or DL model, you’ll have to carefully decide and choose what architecture will work best for your application – in this case, the data distribution. Of course, it will mostly be a trial and error method, unless we find the best architecture of our algorithm that can model the distribution accurately. For that, we need to explore a range of possibilities.

Looking for the optimal model architecture, we can ask the machine learning algorithm to perform a search to give us a good combination of parameters or architecture for our data distribution. Furthermore, parameters that define the model architecture are referred to as hyperparameters, and the process of searching those parameters is referred to as hyperparameter tuning.

Some of the questions that you might ask while designing a model are:

  1. What should be the degree of polynomial features for the linear model?
  2. What should be the maximum depth allowed for the decision tree or random forest?
  3. How many trees should I include in my random forest?
  4. What should be the learning rate for gradient descent? 

In this example, we’ll see how we can use grid-search to find the best parameter for our random forest algorithm to reduce overfitting:

from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
   """ Calculates and returns the performance score between
       true and predicted values based on the metric chosen. """

   score = r2_score(y_true,y_predict)

   return score

In order to do a perfect grid search, we need to make sure that we define all the parameters (that we want to change) in a dictionary:

params = {'max_depth':np.arange(1,11),
         'n_estimators' : [10, 50, 100, 200],
         'oob_score' : [False, True],
         'max_features': ['auto', 'sqrt']}
scoring_fnc = make_scorer(performance_metric)
regressor_tree = RandomForestRegressor(random_state=42, n_jobs = -1)
grid = GridSearchCV(estimator=regressor_tree,param_grid=params,cv=5,scoring=scoring_fnc)

Let’s fit the model and check the accuracy on both training and testing dataset:

grid.fit(X_train, y_train)
grid.score(X_train, y_train), grid.score(X_test, y_test)
 (0.9597848908613165, 0.9481551892881535)

As you can see, that grid search helps you find parameters that can reduce overfitting by a significant amount. 

To know what combination of parameters or architecture grid-search found for the data distribution, just run grid.best_estimator_

MIGHT INTEREST YOU

Hyperparameter Tuning in Python: a Complete Guide 2021

Ensemble methods

Ensemble methods combine a group of predictive models to get an average prediction. It’s a very successful method, not only because it reduces overfitting, but also because it can solve extremely complex problems. A good example is HydraNet, used by Tesla for its self-driving cars. 

Since the car uses eight different cameras to get an overall view of the surroundings, it needs an algorithm which can model all of that input. HydraNet achieves it with an ensemble CNN architecture.

Anyways, so far we’ve seen how we can avoid overfitting with techniques like: introducing a validation dataset, cross-validation, and hyperparameter tuning via grid-search. But, our accuracy is still less than 90%, even though overfitting has been reduced significantly. 

Let’s use ensemble methods to increase the accuracy, while maintaining a good statistical fit over the data distribution. We’ll also use all the techniques we’ve learned so far.

This time, we’ll use a random forest algorithm. Random forest is a combination of decision trees, where the parameter n_estimators defines the number of trees we want in our model. By default, it’s set to 100. 

from sklearn.ensemble import RandomForestRegressor

regressor_for = RandomForestRegressor(random_state=42, n_jobs=-1)
params = {'max_depth':np.arange(1,6),
         'max_features':['auto', 'sqrt', 'log2'],
          'n_estimators': [10, 100, 200]

         }

scoring_fnc = make_scorer(performance_metric)
grid = GridSearchCV(estimator=regressor_for,param_grid=params,cv=5,scoring=scoring_fnc)

grid.fit(X_train, y_train)
print('The training and testing scores of model 1: {} and {}'.format(grid.score(X_train, y_train), grid.score(X_test, y_test)))

After modeling the data with random forest, we get a score on both training and testing as follows: 0.9098966600896439 and 0.8979734271215161. Not only did it increase the accuracy of the model, but maintained a very low overfitting score – good enough. 

CHECK ALSO

A Comprehensive Guide to Ensemble Learning – What Exactly Do You Need to Know

Regularization

So far in our example we’ve worked mostly with non-parametric models, these models don’t operate on any parameter, which gives them an advantage over parametric models – they have a larger capacity. 

“Parametric models learn a function described by a parameter vector whose size is finite and fixed before any data is observed” – Deep Learning Book, Ian Goodfellow.

Both parametric and non-parametric models can overfit, and they can be regularized. 

Regularization is constraining a complex model by making it simple and less flexible. This avoids overfitting

Note: Regularisation is known as shrinkage

Let’s see an example. 

First, we’ll introduce a Ridge Regression algorithm into our data distribution: 

Ridge Regression algorithm

A ridge regression is a regularised version of linear regression:

from sklearn.linear_model import Ridge, LinearRegression

A simple relationship in linear regression looks like this:

where y is the relationship between the input variable x and coefficient or parameter .

During the training phase, the loss function is minimized by updating the coefficients through an optimization process. This loss function is known as the residual sum of squares, or RSS:

Overfitting Vs Underfitting equation

The RSS adjusts coefficients based on your training data. If there’s noise in the training data, the estimated coefficients won’t generalize well to unseen data. The introduction of regularization where the coefficients are too deviated from the true function helps the coefficients shrink, or regularizes these learned estimates towards zero.

Overfitting Vs Underfitting equation

In the equation above, RSS is modified with a regularisation quantity, where is regularization strength. 

Overfitting Vs Underfitting regularization

The blue line represents a model where regularization strength is 0. Notice that when the regularization strength is greater than 0, the model starts to reduce its capacity or flexibility, resulting in reduced overfitting. 

Early stopping

In iterative algorithms, like deep neural networks, or even in a shallow iterative algorithm such as stochastic gradient regressor, you can regularize the model by stopping your training early – as soon as validation error reaches a minimum value. This is early stopping. 

Early stopping
Source: Hands-on machine learning with scikit-learn, keras and tensorflow, Aurélien Geron, 2019

Conclusion

In this article, we explored what overfitting and underfitting are. Let’s summarize:

  • Overfitting is when:
    • Learning algorithm models training data well, but fails to model testing data.
    • Model complexity is higher than data complexity.
    • Data has too much noise or variance.
  • Underfitting is when:
    • Learning algorithm is unable to model training data.
    • Model is simple and rigid, making it unable to capture data points.
    • Model has high bias.
  • Overfitting can be identified when:
    • Training accuracy is higher than the testing accuracy.
    • Training error is higher than the testing error.
  • Underfitting can be identified when accuracy of the model is very low, and increases after many iterations.
Overfitting Vs Underfitting
Source: Deep Learning Book, Ian Goodfellow

To resolve underfitting, increase the capacity of the model. This will increase the complexity and lower the bias.

To resolve overfitting, decrease the capacity of the model. This will increase bias, making the model less complex. Use techniques like:

  1. Introduce a validation set,
  2. Variance-bias tradeoff,
  3. Cross-validation,
  4. Hyperparameter tuning,
  5. Regularization,
  6. Early stopping.

The notebook for this article is provided in this link. Feel free to experiment with it!

Was the article useful?

Thank you for your feedback!