The Ultimate Guide to Evaluation and Selection of Models in Machine Learning
To properly evaluate your machine learning models and select the best one, you need a good validation strategy and solid evaluation metrics picked for your problem.
A good validation (evaluation) strategy is basically how you split your data to estimate future test performance. It could be as simple as a traintest split or a complex stratified kfold strategy.
Once you know that you can estimate the future model performance, you need to choose a metric that fits your problem. If you understand the classification and regression metrics, then most other complex metrics (in object detection, for example) are relatively easy to grasp.
When you nail those two, you are good.
In this article, I will talk about:
 Choosing a good evaluation method (resampling, crossvalidation, etc)
 Popular (and less known) classification and regression metrics
 And bias / variance tradeoffs in machine learning.
So let’s get to it.
NOTE FROM THE PRODUCT TEAM
You can use neptune.ai. to compare experiments and models based on metrics, parameters, learning curves, prediction images, dataset versions, and more.
It makes model evaluation and selection way easier.
For more:
– See the experiment comparison docs
– Learn about our experiment tracking product
– Check out videos projects and other resources about comparing experiments and models
Just to make sure we are on the same page, let’s get the definitions out of the way.
What is model evaluation?
Model evaluation is a process of assessing the model’s performance on a chosen evaluation setup. It is done by calculating quantitative performance metrics like F1 score or RMSE or assessing the results qualitatively by the subject matter experts. The machine learning evaluation metrics you choose should reflect the business metrics you want to optimize with the machine learning solution.
What is model selection?
Model selection is the process of choosing the best ml model for a given task. It is done by comparing various model candidates on chosen evaluation metrics calculated on a designed evaluation schema. Choosing the correct evaluation schema, whether a simple train test split or a complex crossvalidation strategy, is the crucial first step of building any machine learning solution.
How to evaluate machine learning models and select the best one?
We’ll dive into this deeper, but let me give you a quick stepbystep:
Step 1: Choose a proper validation strategy. Can’t stress this enough, without a reliable way to validate your model performance, no amount of hyperparameter tuning and stateoftheart models will help you.
Step 2: Choose the right evaluation metric. Figure out the business case behind your model and try to use the machine learning metric that correlates with that. Typically no one metric is ideal for the problem.
So calculate multiple metrics and make your decisions based on that. Sometimes you need to combine classic ML metrics with a subject matter expert evaluation. And that is ok.
Step 3: Keep track of your experiment results. Whether you use a spreadsheet or a dedicated experiment tracker, make sure to log all the important metrics, learning curves, dataset versions, and configurations. You will thank yourself later.
Step 4: Compare experiments and pick a winner. Regardless of the metrics and validation strategy you choose, at the end of the day, you want to find the best model. But no model is really best, but some are good enough.
So make sure to understand what is good enough for your problem, and once you hit that, move on to other parts of the project, like model deployment or pipeline orchestration.
Model selection in machine learning (choosing model validation strategy)
Resampling methods
Resampling methods, as the name suggests, are simple techniques of rearranging data samples to inspect if the model performs well on data samples that it has not been trained on. In other words, resampling helps us understand if the model will generalize well.
Random Split
Random Splits are used to randomly sample a percentage of data into training, testing, and preferably validation sets. The advantage of this method is that there is a good chance that the original population is well represented in all the three sets. In more formal terms, random splitting will prevent a biased sampling of data.
It is very important to note the use of the validation set in model selection. The validation set is the second test set and one might ask, why have two test sets?
In the process of feature selection and model tuning, the test set is used for model evaluation. This means that the model parameters and the feature set are selected such that they give an optimal result on the test set. Thus, the validation set which has completely unseen data points (not been used in the tuning and feature selection modules) is used for the final evaluation.
TimeBased Split
There are some types of data where random splits are not possible. For example, if we have to train a model for weather forecasting, we cannot randomly divide the data into training and testing sets. This will jumble up the seasonal pattern! Such data is often referred to by the term – Time Series.
In such cases, a timewise split is used. The training set can have data for the last three years and 10 months of the present year. The last two months can be reserved for the testing or validation set.
There is also a concept of window sets – where the model is trained till a particular date and tested on the future dates iteratively such that the training window keeps increasing shifting by one day (consequently, the test set also reduces by a day). The advantage of this method is that it stabilizes the model and prevents overfitting when the test set is very small (say, 3 to 7 days).
However, the drawback of timeseries data is that the events or data points are not mutually independent. One event might affect every data input that follows after.
For instance, a change in the governing party might considerably change the population statistics for the years to follow. Or the infamous coronavirus pandemic is going to have a massive impact on economic data for the next few years.
No machine learning model can learn from past data in such a case because the data points before and after the event have major differences.
KFold CrossValidation
The crossvalidation technique works by randomly shuffling the dataset and then splitting it into k groups. Thereafter, on iterating over each group, the group needs to be considered as a test set while all other groups are clubbed together into the training set. The model is tested on the test group and the process continues for k groups.
Thus, by the end of the process, one has k different results on k different test groups. The best model can then be selected easily by choosing the one with the highest score.
EDITOR’S NOTE
You may want to read:
Stratified KFold
The process for stratified KFold is similar to that of KFold crossvalidation with one single point of difference – unlike in kfold crossvalidation, the values of the target variable is taken into consideration in stratified kfold.
If for instance, the target variable is a categorical variable with 2 classes, then stratified kfold ensures that each test fold gets an equal ratio of the two classes when compared to the training set.
This makes the model evaluation more accurate and the model training less biased.
Bootstrap
Bootstrap is one of the most powerful ways to obtain a stabilized model. It is close to the random splitting technique since it follows the concept of random sampling.
The first step is to select a sample size (which is usually equal to the size of the original dataset). Thereafter, a sample data point must be randomly selected from the original dataset and added to the bootstrap sample. After the addition, the sample needs to be put back into the original sample. This process needs to be repeated for N times, where N is the sample size.
Therefore, it is a resampling technique that creates the bootstrap sample by sampling data points from the original dataset with replacement. This means that the bootstrap sample can contain multiple instances of the same data point.
The model is trained on the bootstrap sample and then evaluated on all those data points that did not make it to the bootstrapped sample. These are called the outofbag samples.
NOTE FROM THE PRODUCT TEAM
Comparing and evaluating crossvalidated results can get tricky.
Healthcare startup Theta Tech AI uses Neptune experiment tracker for that.
“Grouping by validation set is super important to us, and many other people would benefit from using the grouping features with validation.” – Dr. Robert TothFounder of Theta Tech AI
Probabilistic measures
Probabilistic Measures do not just take into account the model performance but also the model complexity. Model complexity is the measure of the model’s ability to capture the variance in the data.
For example, a highly biased model like the linear regression algorithm is less complex and on the other hand, a neural network is very high on complexity.
Another important point to note here is that the model performance taken into account in probabilistic measures is calculated from the training set only. A holdout test set is typically not required.
A fair bit of disadvantage however lies in the fact that probabilistic measures do not consider the uncertainty of the models and has a chance of selecting simpler models over complex models.
Akaike Information Criterion (AIC)
It is common knowledge that every model is not completely accurate. There is always some information loss which can be measured using the KL information metric. KulbackLiebler or KL divergence is the measure of the difference in the probability distribution of two variables.
A statistician, Hirotugu Akaike, took into consideration the relationship between KL Information and Maximum Likelihood (in maximumlikelihood, one wishes to maximize the conditional probability of observing a datapoint X, given the parameters and a specified probability distribution) and developed the concept of Information Criterion (or IC). Therefore, Akaike’s IC or AIC is the measure of information loss. This is how the discrepancy between two different models is captured and the model with the least information loss is suggested as the model of choice.
 K = number of independent variables or predictors
 L = maximumlikelihood of the model
 N = number of data points in the training set (especially helpful in case of small datasets)
The limitation of AIC is that it is not very good with generalizing models as it tends to select complex models that lose less training information.
Bayesian Information Criterion (BIC)
BIC was derived from the Bayesian probability concept and is suited for models that are trained under the maximum likelihood estimation.
 K = number of independent variables
 L = maximumlikelihood
 N = Number of sampler/data points in the training set
BIC penalizes the model for its complexity and is preferably used when the size of the dataset is not very small (otherwise it tends to settle on very simple models).
Minimum Description Length (MDL)
MDL is derived from the Information theory which deals with quantities such as entropy that measure the average number of bits required to represent an event from a probability distribution or a random variable.
MDL or the minimum description length is the minimum number of such bits required to represent the model.
 d = model
 D = predictions made by the model
 L(h) = number of bits required to represent the model
 L(D  h) = number of bits required to represent the predictions from the model
Structural Risk Minimization (SRM)
Machine learning models face the inevitable problem of defining a generalized theory from a set of finite data. This leads to cases of overfitting where the model gets biased to the training data which is its primary learning source. SRM tries to balance out the model’s complexity against its success at fitting on the data.
How to evaluate ML models (choosing performance metrics)
Models can be evaluated using multiple metrics. However, the right choice of an evaluation metric is crucial and often depends upon the problem that is being solved. A clear understanding of a wide range of metrics can help the evaluator to chance upon an appropriate match of the problem statement and a metric.
Classification metrics
For every classification model prediction, a matrix called the confusion matrix can be constructed which demonstrates the number of test cases correctly and incorrectly classified.
It looks something like this (considering 1 Positive and 0 Negative are the target classes):

Actual 0

Actual 1

Predicted 0 
True Negatives (TN) 
False Negatives (FN) 
Predicted 1 
False Positives (FP) 
True Positives (TP) 
 TN: Number of negative cases correctly classified
 TP: Number of positive cases correctly classified
 FN: Number of positive cases incorrectly classified as negative
 FP: Number of negative cases correctly classified as positive
Accuracy
Accuracy is the simplest metric and can be defined as the number of test cases correctly classified divided by the total number of test cases.
It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets.
For instance, if we are detecting frauds in bank data, the ratio of fraud to nonfraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as nonfraud. The 99% accurate model will be completely useless.
If a model is poorly trained such that it predicts all the 1000 (say) data points as nonfrauds, it will be missing out on the 10 fraud data points. If accuracy is measured, it will show that that model correctly predicts 990 data points and thus, it will have an accuracy of (990/1000)*100 = 99%!
This is why accuracy is a false indicator of the model’s health.
Therefore, for such a case, a metric is required that can focus on the ten fraud data points which were completely missed by the model.
Precision
Precision is the metric used to identify the correctness of classification.
Intuitively, this equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, the higher is the precision, which means better is the ability of the model to correctly classify the positive class.
In the problem of predictive maintenance (where one must predict in advance when a machine needs to be repaired), precision comes into play. The cost of maintenance is usually high and thus, incorrect predictions can lead to a loss for the company. In such cases, the ability of the model to correctly classify the positive class and to lower the number of false positives is paramount!
Recall
Recall tells us the number of positive cases correctly identified out of the total number of positive cases.
Going back to the fraud problem, the recall value will be very useful in fraud cases because a high recall value will indicate that a lot of fraud cases were identified out of the total number of frauds.
F1 Score
F1 score is the harmonic mean of Recall and Precision and therefore, balances out the strengths of each.
It is useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and recall will be required to ensure that the machinery is stable and not a threat to human lives.
AUCROC
ROC curve is a plot of true positive rate (recall) against false positive rate (TN / (TN+FP)). AUCROC stands for Area Under the Receiver Operating Characteristics and the higher the area, the better is the model performance.
If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.
Related article
️ F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?
Log Loss
Log loss is a very effective classification metric and is equivalent to 1* log (likelihood function) where the likelihood function suggests how likely the model thinks the observed set of outcomes was.
Since the likelihood function provides very small values, a better way to interpret them is by converting the values to log and the negative is added to reverse the order of the metric such that a lower loss score suggests a better model.
Gain and Lift Charts
Gain and lift charts are tools that evaluate model performance just like the confusion matrix but with a subtle, yet significant difference. The confusion matrix determines the performance of the model on the whole population or the entire test set, whereas the gain and lift charts evaluate the model on portions of the whole population. Therefore, we have a score (yaxis) for every % of the population (xaxis).
Lift charts measure the improvement that a model brings in compared to random predictions. The improvement is referred to as the ‘lift’.
KS Chart
The KS chart or KolmogorovSmirnov chart determines the degree of separation between two distributions – the positive class distribution and the negative class distribution. The higher the difference, the better is the model at separating the positive and negative cases.
Regression metrics
Regression models provide a continuous output variable, unlike classification models that have discrete output variables. Therefore, the metrics for assessing the regression models are accordingly designed.
Mean Squared Error or MSE
MSE is a simple metric that calculates the difference between the actual value and the predicted value (error), squares it and then provides the mean of all the errors.
MSE is very sensitive to outliers and will show a very high error value even if a few outliers are present in the otherwise wellfitted model predictions.
Root Mean Squared Error or RMSE
RMSE is the root of MSE and is beneficial because it helps to bring down the scale of the errors closer to the actual values, making it more interpretable.
NOTE FROM THE PRODUCT TEAM
It is often a good idea to log more metrics then you need for every model training run.
But then you have to visulize and compare them. That is where tools for experiment tracking help.
You can use neptune.ai. to visualize, organize, and compare all your model metrics in one place.
For more:
– See 2min product walk through
– See the experiment comparison docs
– Learn about our experiment tracking product
– Check out videos projects and other resources about comparing experiments and models
Mean Absolute Error or MAE
MAE is the mean of the absolute error values (actuals – predictions).
If one wants to ignore the outlier values to a certain degree, MAE is the choice since it reduces the penalty of the outliers significantly with the removal of the square terms.
Root Mean Squared Log Error or RMSLE
In RMSLE, the same equation as that of RMSE is followed except for an added log function along with the actual and predicted values.
x is the actual value and y is the predicted value. This helps to scale down the effect of the outliers by downplaying the higher error rates with the log function. Also, RMSLE helps to capture a relative error (by comparing all the error values) through the use of logs.
RSquared
RSquare helps to identify the proportion of variance of the target variable that can be captured with the help of the independent variables or predictors.
Rsquare, however, has a gigantic problem. Say, a new unrelated feature is added to a model with an assigned weight of w. If the model finds absolutely no correlation between the new predictor and the target variable, w is 0. However, there is almost always a small correlation due to randomness which adds a small positive weight (w>0) and a new loss minimum is achieved due to overfitting.
This is why the Rsquared increases with any new feature addition. Thus, its inability to decrease in value when new features are added limits its ability to identify if the model did better with lesser features.
Adjusted RSquared
Adjusted RSquare solves the problem of RSquare by dismissing its inability to reduce in value with added features. It penalizes the score as more features are added.
The denominator here is the magic element which increases with the increase in the number of features. Therefore, a significant increase in R^{2} is required to increase the overall value.
Clustering metrics
Clustering algorithms predict groups of datapoints and hence, distancebased metrics are most effective.
Dunn Index
Dunn Index focuses on identifying clusters that have low variance (among all members in the cluster) and are compact. The mean values of the different clusters also need to be far apart.
 δ(Xi, Yj) is the intercluster distance i.e. the distance between Xi and Xj
 ∆(Xk) is the intercluster distance of cluster Xk i.e.distance within the cluster Xk
However, the disadvantage of Dunn index is that with a higher number of clusters and more dimensions, the computation cost increases.
Silhouette Coefficient
Silhouette Coefficient tracks how every point in one cluster is close to every point in the other clusters in the range of 1 to +1.:
 Higher Silhouette values (closer to +1) indicate that the sample points from two different clusters are far away.
 0 indicates that the points are close to the decision boundary
 and values closer to 1 suggests that the points have been incorrectly assigned to the cluster.
Elbow method
The elbow method is used to determine the number of clusters in a dataset by plotting the number of clusters on the xaxis against the percentage of variance explained on the yaxis. The point in xaxis where the curve suddenly bends (the elbow) is considered to suggest the optimal number of clusters.
Related articles
Tradeoffs in ml model selection
Bias vs variance
On a high level, Machine Learning is the union of statistics and computation. The crux of machine learning revolves around the concept of algorithms or models, which are in fact, statistical estimations on steroids.
However, any given model has several limitations depending on the data distribution. None of them can be entirely accurate since they are just estimations (even if on steroids). These limitations are popularly known by the name of bias and variance.
A model with high bias will oversimplify by not paying much attention to the training points (e.g.: in Linear Regression, irrespective of data distribution, the model will always assume a linear relationship).
Bias occurs when a model is strictly ruled by assumptions – like the linear regression model assumes that the relationship of the output variable with the independent variables is a straight line. This leads to underfitting when the actual values are nonlinearly related to the independent variables.
A model with high variance will restrict itself to the training data by not generalizing for test points that it hasn’t seen before (e.g.: Random Forest with max_depth = None).
Variance is high when a model focuses on the training set too much and learns the variations very closely, compromising on generalization. This leads to overfitting.
The issue arises when the limitations are subtle, like when we have to choose between a random forest algorithm and a gradient boosting algorithm or between two variations of the same decision tree algorithm. Both will tend to have high variance and low bias.
An optimal model is one that has the lowest bias and variance and since these two attributes are indirectly proportional, the only way to achieve this is through a tradeoff between the two. Therefore, the model selection should be such that the bias and variance intersect like in the image below.
This can be achieved by iteratively tuning the hyperparameters of the model in use (Hyperparameters are the input parameters that are fed to the model functions). After every iteration, the model evaluation must take place with the use of a suitable metric.
Learning curves
The best way to track the progress of model training or buildup is to use learning curves. These curves help to identify the optimal points in a set of hyperparameter combinations and assists massively in the model selection and model evaluation process.
Typically, a learning curve is a way to track the learning or improvement in model performance on the yaxis and the time or experience on the xaxis.
The two most popular learning curves are:
 Training Learning Curve – It effectively plots the evaluation metric score overtime during a training process and thus, helps to track the learning or progress of the model during training.
 Validation Learning Curve – In this curve, the evaluation metric score is plotted against time on the validation set.
Sometimes it might so happen that the training curve shows an improvement but the validation curve shows stunted performance.
This is indicative of the fact that the model is overfitting and needs to be reverted to the previous iterations. In other words, the validation learning curve identifies how well the model is generalizing.
Therefore, there is a tradeoff between the training learning curve and the validation learning curve and the model selection technique must rely upon the point where both the curves intersect and are at their lowest.
NOTE FROM THE PRODUCT TEAM
Monitoring learning curves and comapring them between model training runs is one of the core functionalities of experiment trackers.
You can use neptune.ai. to see your learning curves live and stop unpromising runs before they finish.
For more:
– See 2min product walk through
– See the docs about monitoring model training runs
– Learn about our experiment tracking product
– Check out videos, projects, and other resources about visualizing learning curves
Ok, but how do you actually do it?
What is next
Evaluating ML models and selecting the bestperforming one is one of the main activities you do in preproduction.
Hopefully, with this article, you’ve learned how to properly set up a model validation strategy and then how to choose a metric for your problem.
You are ready to run a bunch of experiments and see what works.
With that comes another problem of keeping track of experiment parameters, datasets used, configs, and results.
And figuring out how to visualize and compare all of those models and results.
For that, you may want to check out:
 “ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It”
 “15 Best Tools for ML Experiment Tracking and Management”
 “Visualizing Machine Learning Models: Guide and Tools”
 “How to Compare Machine Learning Models and Algorithms”
Other resources
Crossvalidation and evaluation strategies from Kaggle competitions:
 Image Classification: Tips and Tricks From 13 Kaggle Competitions (+ Tons of References)
 Binary Classification: Tips and Tricks From 10 Kaggle Competitions
 Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions
 Text Classification: All Tips and Tricks from 5 Kaggle Competitions
 Image Segmentation: Tips and Tricks from 39 Kaggle Competitions
Evaluation metrics and visualization:
 Recommender Systems: Machine Learning Metrics and Business Metrics
 How to Track Machine Learning Model Metrics in Your Projects
 The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments
 24 Evaluation Metrics for Binary Classification (And When to Use Them)
 How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)
Experiment tracking videos and realworld case studies:
 Selecting the best computer vision models at Brainly
 How to Use CI/CD to Automate the RL Evaluation Pipeline
 Comparing CI/CD pipeline runs at Continuum Industries
 How to Compare Images Between Runs
 Scaling ML research at AILS Labs
 Visualizing hyperparameter optimization studies at Theta Tech AI
 How to Monitor Model Training Runs Live