Cross-Validation in Machine Learning: How to Do It Right

Posted October 6, 2020

In machine learning (ML), generalization usually refers to the ability of an algorithm to be effective across various inputs. It means that the ML model does not encounter performance degradation on the new inputs from the same distribution of the training data.

For human beings generalization is the most natural thing possible. We can classify on the fly. For example, we would definitely recognize a dog even if we didn’t see this breed before. Nevertheless, it might be quite a challenge for an ML model. That’s why checking the algorithm’s ability to generalize is an important task that requires a lot of attention when building the model.

To do that, we use Cross-Validation (CV).

In this article we will cover:

  • What is Cross-Validation: definition, purpose of use and techniques
  • Different CV techniques: hold-out, k-folds, Leave-one-out, Leave-p-out, Stratified k-folds, Repeated k-folds, Nested k-folds, Complete CV
  • How to use these techniques: sklearn
  • Cross-Validation in Machine Learning: sklearn, CatBoost
  • Cross-Validation in Deep Learning: Keras, PyTorch, MxNet
  • Best practises and tips: time series, medical and financial data, images

What is Cross-Validation

Cross-validation is a technique for evaluating a machine learning model and testing its performance. CV is commonly used in applied ML tasks. It helps to compare and select an appropriate model for the specific predictive modeling problem.

CV is easy to understand, easy to implement, and it tends to have a lower bias than other methods used to count the model’s efficiency scores. All this makes cross-validation a powerful tool for selecting the best model for the specific task.

There are a lot of different techniques that may be used to cross-validate a model. Still, all of them have a similar algorithm:

  1. Divide the dataset into two parts: one for training, other for testing
  2. Train the model on the training set
  3. Validate the model on the test set
  4. Repeat 1-3 steps a couple of times. This number depends on the CV method that you are using

As you may know, there are plenty of CV techniques. Some of them are commonly used, others work only in theory. Let’s see cross-validation methods that will be covered in this article.

  • Hold-out
  • K-folds
  • Leave-one-out
  • Leave-p-out
  • Stratified K-folds
  • Repeated K-folds
  • Nested K-folds
  • Complete 

Hold-out

Hold-out cross-validation is the simplest and most common technique. You might not know that it is hold-out method but you certainly use it every day.

The algorithm of hold-out technique:

  1. Divide the dataset into two parts: the training set and the test set. Usually, 80% of the dataset goes to the training set and 20% to the test set but you may choose any splitting that suits you better
  2. Train the model on the training set
  3. Validate on the test set
  4. Save the result of the validation
Hold-out cross-validation

That’s it. 

We usually use hold-out method on large datasets as it requires training the model only once.

It is really easy to implement hold-out. For example, you may do it using sklearn.model_selection.train_test_split.

import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=111)

Still, hold-out has a major disadvantage. 

For example, a dataset that is nor completely even distribution-wise. If so we may end up in a rough spot after the split. For example, the training set will not represent the test set. Both training and test sets may differ a lot, one of them might be easier or harder. 

Moreover, the fact that we test our model only once might be a bottleneck for this method. Due to the reasons mentioned before, the result obtained by hold-out technique may be considered inaccurate.

k-Fold

k-Fold CV is a technique that minimizes the disadvantages of hold-out method. k-Fold introduces a new way of splitting the dataset which helps to overcome the “test only once bottleneck”.

The algorithm of k-Fold technique:

  1. Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is less than the dataset’s length.
  2. Split the dataset into k equal (if possible) parts (they are called folds)
  3. Choose k – 1 folds which will be the training set. The remaining fold will be the test set
  4. Train the model on the training set. On each iteration of cross-validation, you must train a new model independently of the model trained on the previous iteration
  5. Validate on the test set
  6. Save the result of the validation
  7. Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
  8. To get the final score average the results that you got on step 6.
k-Fold cross validation

To perform k-Fold cross-validation you can use sklearn.model_selection.KFold.

import numpy as np
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In general, it is always better to use k-Fold technique instead of hold-out. In a head to head, comparison k-Fold gives a more stable and trustworthy result since training and testing is performed on several different parts of the dataset. We can make the overall score even more robust if we increase the number of folds to test the model on many different sub-datasets.

Still, k-Fold method has a disadvantage. Increasing k results in training more models and the training process might be really expensive and time-consuming.

Leave-one-out

Leave-one-out сross-validation (LOOCV) is an extreme case of k-Fold CV. Imagine if k is equal to n where n is the number of samples in the dataset. Such k-Fold case is equivalent to Leave-one-out technique.


The algorithm of LOOCV technique:

  1. Choose one sample from the dataset which will be the test set
  2. The remaining n – 1 samples will be the training set
  3. Train the model on the training set. On each iteration, a new model must be trained
  4. Validate on the test set
  5. Save the result of the validation
  6. Repeat steps 1 – 5 n times as for n samples we have n different training and test sets
  7. To get the final score average the results that you got on step 5.
Leave-one-out сross-validation

For LOOCV sklearn also has a built-in method. It can be found in the model_selection library – sklearn.model_selection.LeaveOneOut.

import numpy as np
from sklearn.model_selection import LeaveOneOut

X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

The greatest advantage of Leave-one-out cross-validation is that it doesn’t waste much data. We use only one sample from the whole dataset as a test set, whereas the rest is the training set. But when compared with k-Fold CV, LOOCV requires building n models instead of k models, when we know that n which stands for the number of samples in the dataset is much higher than k. It means LOOCV is more computationally expensive than k-Fold, it may take plenty of time to cross-validate the model using LOOCV.

Thus, the Data Science community has a general rule based on empirical evidence and different researches, which suggests that 5- or 10-fold cross-validation should be preferred over LOOCV.

Leave-p-out

Leave-p-out cross-validation (LpOC) is similar to Leave-one-out CV as it creates all the possible training and test sets by using p samples as the test set. All mentioned about LOOCV is true and for LpOC.

Still, it is worth mentioning that unlike LOOCV and k-Fold test sets will overlap for LpOC if p is higher than 1.

The algorithm of LpOC technique:

  1. Choose p samples from the dataset which will be the test set
  2. The remaining n p samples will be the training set
  3. Train the model on the training set. On each iteration, a new model must be trained
  4. Validate on the test set
  5. Save the result of the validation
  6. Repeat steps 2 – 5 Cpn times 
  7. To get the final score average the results that you got on step 5

You can perform Leave-p-out CV using sklearn – sklearn.model_selection.LeavePOut.

import numpy as np
from sklearn.model_selection import LeavePOut

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
lpo = LeavePOut(2)

for train_index, test_index in lpo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

LpOC has all the disadvantages of the LOOCV, but, nevertheless, it’s as robust as LOOCV.

Stratified k-Fold

Sometimes we may face a large imbalance of the target value in the dataset. For example, in a dataset concerning wristwatch prices, there might be a larger number of wristwatch having a high price. In the case of classification, in cats and dogs dataset there might be a large shift towards the dog class.

Stratified k-Fold is a variation of the standard k-Fold CV technique which is designed to be effective in such cases of target imbalance. 

It works as follows. Stratified k-Fold splits the dataset on k folds such that each fold contains approximately the same percentage of samples of each target class as the complete set. In the case of regression, Stratified k-Fold makes sure that the mean target value is approximately equal in all the folds.

The algorithm of Stratified k-Fold technique:

  1. Pick a number of folds – k
  2. Split the dataset into k folds. Each fold must contain approximately the same percentage of samples of each target class as the complete set 
  3. Choose k – 1 folds which will be the training set. The remaining fold will be the test set
  4. Train the model on the training set. On each iteration a new model must be trained
  5. Validate on the test set
  6. Save the result of the validation
  7. Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
  8. To get the final score average the results that you got on step 6.

As you may have noticed, the algorithm for Stratified k-Fold technique is similar to the standard k-Folds. You don’t need to code something additionally as the method will do everything necessary for you.

Stratified k-Fold also has a built-in method in sklearn – sklearn.model_selection.StratifiedKFold.

import numpy as np
from sklearn.model_selection import StratifiedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)

for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

All mentioned above about k-Fold CV is true for Stratified k-Fold technique. When choosing between different CV methods, make sure you are using the proper one. For example, you might think that your model performs badly simply because you are using k-Fold CV to validate the model which was trained on the dataset with a class imbalance. To avoid that you should always do a proper exploratory data analysis on your data.

Repeated k-Fold

Repeated k-Fold cross-validation or Repeated random sub-samplings CV is probably the most robust of all CV techniques in this paper. It is a variation of k-Fold but in the case of Repeated k-Folds k is not the number of folds. It is the number of times we will train the model.

The general idea is that on every iteration we will randomly select samples all over the dataset as our test set. For example, if we decide that 20% of the dataset will be our test set, 20% samples will be randomly selected and the rest 80% will become the training set. 

The algorithm of Repeated k-Fold technique:

  1. Pick k – a number of times the model will be trained
  2. Pick a number of samples which will be the test set
  3. Split the dataset
  4. Train on the training set. On each iteration of cross-validation, a new model must be trained
  5. Validate on the test set
  6. Save the result of the validation
  7. Repeat steps 3-6 k times
  8. To get the final score average the results that you got on step 6.
Repeated k-Fold

Repeated k-Fold has clear advantages over standard k-Fold CV. Firstly, the proportion of train/test split is not dependent on the number of iterations. Secondly, we can even set unique proportions for every iteration. Thirdly, random selection of samples from the dataset makes Repeated k-Fold even more robust to selection bias.

Still, there are some disadvantages. k-Fold CV guarantees that the model will be tested on all samples, whereas Repeated k-Fold is based on randomization which means that some samples may never be selected to be in the test set at all. At the same time, some samples might be selected multiple times.

Sklearn will help you to implement Repeated k-Fold CV. Just use sklearn.model_selection.RepeatedKFold. In sklearn implementation of this technique you must set the number of folds that you want to have (n_splits) and the number of times the split will be performed (n_repeats). It guarantees that you will have different folds on each iteration.

import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)

for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Nested k-Fold

Unlike the other CV techniques, which are designed to evaluate the quality of an algorithm, Nested k-Fold CV is the most popular way to tune the parameters of an algorithm.


Imagine that we have a parameter p which usually depends on the base algorithm that we are cross-validating. For example, for Logistic Regression it might be the penalty parameter which is used to specify the norm used in the penalization.

The algorithm of Nested k-Fold technique:

  1. Pick k – a number of folds, for example, 10 – let’s assume that we’ve picked this number
  2. Pick a parameter p. Let’s assume that our algorithm is Logistic Regression and p is the penalty parameter p = {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}
  3. Divide the dataset into 10 folds and reserve one of them for test
  4. Reserve one of the training folds for validation
  5. For each value of p train on the 8 remaining training folds and evaluate on the validation fold. You now have 4 measurements
  6. Repeat steps 4-5 9 times. Rotate which training fold is the validation fold. You now have 9 * 4 measurements
  7. Choose p that minimizes the average training error over 9 folds. Use that p to evaluate on the test set
  8. Repeat 10 times from step 2, using each fold in turn as the test fold
  9. Save the mean and standard deviation of the evaluation measure over the 10 test folds
  10. The algorithm that performed the best was the one with the best average out-of-sample performance across the 10 test folds

This technique is computationally expensive because throughout steps 1 – 10 plenty of models should be trained and evaluated. Still, Nested k-Fold CV is commonly used and might be really effective across multiple ML tasks.

Unfortunately, there is no built-in method in sklearn that would perform Nested k-Fold CV for you. This’s the moment when you should either Google and find someone’s implementation or code it yourself.

Complete Cross-Validation

Complete CV is the least used CV technique. The general idea is that we choose a number k – the length of the training set and validate on every possible split containing k samples in the training set.

The amount of those splits can be calculated as Сnk where n is the length of the dataset. If k is higher than 2, we will have to train our model plenty of times which as we have already figured out is an expensive procedure time and computation-wise.

This is why Complete CV is used either in theoretical researches or if there is an effective formula that will help to minimize the calculations.

The algorithm of Complete cross-validation:

  1. Pick a number k – length of the training set
  2. Split the dataset
  3. Train on the training set
  4. Validate on the test set
  5. Save the result of the validation
  6. Repeat steps 2 – 5 Сnk times
  7. To get the final score average the results that you got on step 5

Cross-Validation in Machine Learning

Most of the cross-validation techniques mentioned above are widely used in ML. It’s important to remember that using the proper CV technique may save you a lot of time and help to select the best model for the task.

It means that, firstly, it’s better to always cross-validate the model and, secondly, you should choose a relevant CV method. Thus, knowing the benefits and disadvantages of cross-validation techniques is vital.

It’s worth mentioning that if you want to cross-validate the model you should always check the model’s manual because some ML algorithms, for example, CatBoost have their own built-in CV methods. You may find them relevant for your ML task and use them instead of sklearn built-in methods.

In general, as you may have noticed many CV techniques have sklearn built-in methods. I strongly recommend using them as these methods will save you plenty of time for more complicated tasks.

Cross-Validation in Deep Learning

Cross-validation in Deep Learning (DL) might be a little tricky because most of the CV techniques require training the model at least a couple of times. 

In deep learning, you would normally tempt to avoid CV because of the cost associated with training k different models. Instead of doing k-Fold or other CV technique, you might use a random subset of your training data as a hold-out for validation purposes.

For example, Keras deep learning library allows you to pass one of two parameters for the fit function that performs training.

  1. validation_split: percentage of the data that should be held out for validation
  2. validation_data: a tuple of (X, y) which should be used for validation. This parameter overrides the validation_split parameter which means you can use only one of these parameters at once.

The same approach is used in official tutorials of other DL frameworks such as PyTorch and MxNet. They also suggest splitting the dataset into three parts: training, validation, and testing.

  1. Training – a part of the dataset to train on
  2. Validation – a part of the dataset to validate on while training
  3. Testing – a part of the dataset for final validation of the model

Still, you can use cross-validation in DL tasks if the dataset is tiny (contains hundreds of samples). In this case, learning a complex model might be an irrelevant task so make sure that you don’t complicate the task further.

Best practices and tips

It’s worth mentioning that sometimes performing cross-validation might be a little tricky. 

For example, it’s quite easy to make a logical mistake when splitting the dataset which may lead to an untrustworthy CV result. 

You may find some tips that you need to keep in mind when cross-validating a model below:

  1. Be logical when splitting the data (does the splitting method make sense)
  2. Use the proper CV method (is this method viable for my use-case)
  3. When working with time series don’t validate on the past (see the first tip)
  4. When working with medical or financial data remember to split by person. Avoid having data for one person both in the training and the test set as it may be considered as data leak
  5. When cropping patches from larger images remember to split by the large image Id

Of course, tips differ from task to task and it’s almost impossible to cover all of them. That’s why performing a solid exploratory data analysis before starting to cross-validate a model is always the best practice.

Final thoughts

Cross-validation is a powerful tool. Every Data Scientist should be familiar with it. In real life, you can’t finish the project without cross-validating a model.

In my opinion, the best CV techniques are Nested k-Fold and standard k-Fold. Personally, I used them in the task of Fraud Detection. Nested k-Fold as well as GridSeachCV helped me to tune the parameters of my model. k-Fold on the other hand was used to evaluate my model’s performance.

In this article, we have figured out what cross-validation is, what CV techniques are there in the wild, and how to implement them. In the future ML algorithms will definitely perform even better than today. Still, cross-validation  will always be needed to  back your results up.

Hopefully, with this information, you will have no problems setting up the CV for your next machine learning project!

Resources

  1. https://www.geeksforgeeks.org/cross-validation-machine-learning/
  2. https://machinelearningmastery.com/k-fold-cross-validation/
  3. https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
  4. https://towardsdatascience.com/why-and-how-to-do-cross-validation-for-machine-learning-d5bd7e60c189
  5. https://scikit-learn.org/stable/modules/cross_validation.html  
Data Scientist