Machine learning has expanded rapidly in the last few years. Instead of simple, onedirectional, or linear ML pipelines, today data scientists and developers run multiple parallel experiments that can get overwhelming even for large teams. Each experiment is expected to be recorded in an immutable and reproducible format, which results in endless logs with invaluable details.
We need to narrow down on techniques by comparing machine learning models thoroughly with parallel experiments. Using a wellplanned approach is necessary to understand how to choose the right combination of algorithms and the data at hand.
So, in this article, we’re going to explore how to approach comparing ML models and algorithms.
Comparing ML models is part of the broader process of tracking ML experiments.
Other than that, experiment tracking is about storing all the important data and metadata, debugging model training, and, generally, analyzing the results of experiments.
—
On the blog, you can find a whole piece on what experiment tracking is, written by Jakub Czakon, one of the cofounders of neptune.ai (which is actually an experiment tracking tool and the company behind this blog).
—
There’s also an article about the 15 best experiment tracking tools. Yes, neptune.ai is the first one on the list, but you can expect impartial content, including an extensive table comparing the features of all 15 tools.
The challenge of model selection
Each model or any machine learning algorithm has several features that process the data in different ways. Often the data that is fed to these algorithms is also different depending on previous experiment stages. But, since machine learning teams and developers usually record their experiments, there’s ample data available for comparison.
The challenge is to understand which parameters, data, and metadata must be considered to arrive at the final choice. It’s the classic paradox of having an overwhelming amount of details with no clarity.
Even more challenging, we need to understand if a parameter with a high value, say a higher metric score, actually means the model is better than one with a lower score, or if it’s only caused by statistical bias or misdirected metric design.
Comparing machine learning algorithms: why we do it?
Comparing machine learning algorithms is important in itself, but there are some notsoobvious benefits of comparing various experiments effectively. Let’s take a look at the goals of comparison:
 Better performance
The primary objective of model comparison and selection is definitely better performance of the machine learning software/solution. The objective is to narrow down on the best algorithms that suit both the data and the business requirements.
 Longer lifetime
High performance can be shortlived if the chosen model is tightly coupled with the training data and fails to interpret unseen data. So, it’s also important to find the model that understands underlying data patterns so that the predictions are longlasting and the need for retraining is minimal.
 Easier retraining
When models are evaluated and prepared for comparisons, minute details, and metadata get recorded which come in handy during retraining. For example, if a developer can clearly retrace the reasons behind choosing a model, the causes of model failure will immediately pop out and retraining can start with equal speed.
 Speedy production
With the model details available at hand, it’s easy to narrow down on models that can offer high processing speed and use memory resources optimally. Also during production, several parameters are required to configure the machine learning solutions. Having productionlevel data can be useful for easily aligning with the production engineers. Moreover, knowing the resource demands of different algorithms, it will also be easier to check their compliance and feasibility with respect to the organization’s allocated assets.
You may find interesting
They summed up really well what the goals of efficient comparison methods are and what benefits they bring:
Neptune made it much easier to compare the models and select the best one over the last couple of months, especially since we’ve been working on this player and team separation model in an unsupervised way, during a match, to split the players into two separate teams. Łukasz Grad, Chief Data Scientist at ReSpo.Vision
If we can choose the best performing model, then we can save time because we would need fewer integrations to ensure high data quality. Customers are much happier because they receive higher quality data, enabling them to perform more detailed match analytics. (…) If we know which models will be the best and how to choose the best parameters for them to run many pipelines, then we will just run fewer pipelines. This, in turn, will cause the compute time to be shorter, and then we save money by not running unnecessary pipelines that will deliver suboptimal results. Wojtek Rosiński, Chief Technology Officer at ReSpo.Vision
Parameters of machine learning algorithms and how to compare them
Let’s dive right into analyzing and understanding how to compare the different characteristics of algorithms that can be used to sort and choose the best machine learning models. I divided the comparable parameters into two highlevel categories:
 developmentbased,
 and productionbased parameters.
Developmentbased parameters
Statistical tests
On a fundamental level, machine learning models are statistical equations that run at great speed on multiple data points to arrive at a conclusion. Therefore, conducting statistical tests on the algorithms is critical to set them right and also to understand if the model’s equation is the right fit for the dataset at hand. Here’s a handful of popular statistical tests that can be used to set the grounds for comparison:
 Null hypothesis testing: Null hypothesis testing is used to determine if the differences in two data samples or metric performances are statistically significant or moreorless equal and caused only by noise or coincidence.
 ANOVA: Analysis Of Variance, it’s similar to Linear Discriminant analysis with the exception of the fact that it uses one or more categorical features and one continuous target, providing the statistical test of whether the means of the different groups are similar or not.
 ChiSquare: It’s a statistical tool or test which can be used on groups of categorical features to evaluate the likelihood of association or correlation with the help of frequency distributions.
 Student’s ttest: It compares the averages or means of different samples from normal distributions when the standard deviation is unknown to determine if the differences are statistically significant.
 Tenfold crossvalidation: The 10fold crossvalidation compares the performance of each algorithm on different datasets that have been configured with the same random seed so as to maintain uniformity in testing. Next, a hypothesis test like the student’s paired ttest should be deployed to validate if the differences in metrics between the two models are statistically significant.
Model features and objectives
To choose the best machine learning model for a given dataset, it’s essential to consider the features or parameters of the model. The parameters and model objectives help to gauge the model’s flexibility, assumptions, and learning style.
For example, if two linear regression models are compared, one model might be aiming to reduce the mean squared error, whereas another might be aiming to reduce the mean absolute error through objective functions. To understand if the second model is a better fit, we need to understand if the outliers in the data influence the results, or if they’re not supposed to affect the data. If the anomalies or the outliers must be considered, using the second model with objective function as the mean absolute error will be the right choice.
Similarly for classification, if two models (for example, decision tree and random forest) are considered, then the primary basis for comparison will be the degree of generalization that the model can achieve. A decision tree model with just one tree will have a limited ability to reduce variance through the max_depth parameter, whereas random forest will have an extended ability to bring generalization via both max_depth and n_estimators parameters.
There are several other behavioural features of the model that can be taken into account, like the type of assumptions made by the model, parametricity, speed, learning styles (treebased vs nontree based), and more.
You can use parallel coordinates to see how different model parameters affect the metrics. Here’s what it looks like in Neptune:
Learning curves
Learning curves can help in determining if a model is on the correct learning trajectory of achieving the biasvariance tradeoff. It also provides a basis for comparing different machine learning models – a model with stable learning curves across both training and validation sets is likely going to perform well over a longer period on unseen data.
Bias is the assumption used by machine learning models to make the learning process easier. Variance is the measure of how much the estimated target variable will change with a change in training data. The ultimate goal is to reduce both bias and variance to a minimum – a state of high stability with few assumptions.
Bias and variance are indirectly proportional to each other, and the only way to reach a minimum point for both is at the intersection point. One way to understand if a model has achieved a significant level of tradeoff is to see if its performance across training and testing datasets is nearly similar.
The best way to track the progress of model training is to use learning curves. These curves help to identify the optimal combinations of hyperparameters and assists massively in model selection and model evaluation. Typically, a learning curve is a way to track the learning or improvement in model performance on the yaxis and the time or experience on the xaxis.
The two most popular learning curves are:
 Training learning curve – It effectively plots the evaluation metric score over time during a training process, thus helping to track the learning or progress of the model during training.
 Validation learning curve – In this curve, the evaluation metric score is plotted against time on the validation set.
Sometimes it might happen that the training curve shows an improvement but the validation curve shows stunted performance. This is indicative of the fact that the model is overfitting and needs to be reverted to the previous iterations. In other words, the validation learning curve identifies how well the model is generalizing.
Therefore, there’s a tradeoff between the training learning curve and the validation learning curve and the model selection technique must rely upon the point where both the curves intersect and are at their lowest.
Here’s an example of comparing learning curves in Neptune using the charts view:
Loss functions and metrics
Often there’s confusion between loss functions and metric functions. Loss functions are used for model optimization or model tuning, whereas metric functions are used for model evaluation and selection. However, since regression accuracy can’t be calculated, the same metrics are used to evaluate performance as well as model error for optimization.
Loss functions are passed as arguments to the models such that the models can get tuned to minimize the loss function. A high penalty is provided by the loss function when the model settles on incorrect judgement.
1. Loss functions and metrics for regression:
 Mean Square Error: measures the average of the squares of the errors or deviations, that is, the difference between the estimated and true value. It aids in imposing higher weights on outliers, thus reducing the issue of overfitting.
 Mean Absolute Error: It’s the absolute difference between the estimated value and true value. It decreases the weight for outlier errors when compared to the mean squared error.
 Smooth Absolute Error: It’s the absolute difference between the estimated value and true value for the predictions lying close to the real value, and it’s the square of the difference between the estimated and the true values of the outliers (or points far off from predicted values). Essentially, it’s a combination of MSE and MAE.
2. Loss functions for classification:
 01 loss function: This is like counting the number of misclassified samples. There’s nothing more to it. It can easily be determined from a confusion matrix which shows the number of misclassifications and correct classifications. It’s designed to penalize misclassifications and to assign the smallest loss to the solution that has the greatest number of correct classifications.
 Hinge loss function (L2 regularized): The hinge loss is used for maximummargin classification, most notably for support vector machines (SVMs). Basically, it measures squared distance from a margin between two different classes and the straight lines running parallel through the nearest points in the classes on either side of the margin.
 Logistic Loss: This function displays a similar convergence rate to the hinge loss function, and since it’s continuous (unlike Hinge Loss), gradient descent methods can be used. However, the logistic loss function doesn’t assign zero penalties to any point. Instead, functions that correctly classify points with high confidence are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data.
 Cross entropy/log loss: Measures the performance of a classification model whose output is a probability value between 0 and 1. Crossentropy loss increases as the predicted probability diverges from the actual label.
There are several other loss functions that can be used to optimize machine learning models. The ones mentioned above are the primary ones that essentially built the foundation for the development of the model design.
3. Metrics for classification:
For every classification model prediction, a matrix called the confusion matrix can be constructed which demonstrates the number of test cases correctly and incorrectly classified. It looks something like this (considering 1 – Positive and 0 – Negative are the target classes):

Actual 0

Actual 0

Predicted 0 
True Negatives (TN) 
False Negatives (FN) 
Predicted 1 
False Positives (FP) 
True Positives (TP) 
 TN: Number of negative cases correctly classified
 TP: Number of positive cases correctly classified
 FN: Number of positive cases incorrectly classified as negative
 FP: Number of negative cases correctly classified as positive
4. Accuracy
Accuracy is the simplest metric and can be defined as the number of test cases correctly classified divided by the total number of test cases.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets. For instance, if we’re detecting fraud in bank data, the ratio of fraud to nonfraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as nonfraud.
This is why accuracy is a false indicator of model health, and for such a case, a metric is required that can focus on the fraud data points.
5. Precision
Precision is the metric used to identify the correctness of classification.
Precision = TP / (TP + FP)
Intuitively, this equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, the higher the precision, which means better ability of the model to correctly classify the positive class.
6. Recall
Recall tells us the number of positive cases correctly identified out of the total number of positive cases.
Recall = TP / (TP + FN)
7. F1 Score
F1 score is the harmonic mean of Recall and Precision, therefore it balances out the strengths of each. It’s useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and recall will be required to ensure that the machinery is stable and not a threat to human lives.
F1 Score = 2 * ((precision * recall) / (precision + recall))
8. AUCROC
ROC curve is a plot of true positive rate (recall) against false positive rate (TN / (TN+FP)). AUCROC stands for Area Under the Receiver Operating Characteristics and the higher the area, the better the model performance. If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.
Again, you can compare each of those metrics in Neptune.
That’s what they do at Hypefactors, a media intelligence company:
“We use Neptune for most of our tracking tasks, from experiment tracking to uploading the artifacts. A very useful part of tracking was monitoring the metrics, now we could easily see and compare those Fscores and other metrics.” – Andrea Duque, Data Scientist at Hypefactors
Okay, back to “how to do it” part. You can use charts view (like you could see on the examples I provided earlier). But you can also do it in a sidebyside table format. Here’s what it looks like:
For convenience, Neptune allows you to show rows with diffs only and show cell changes (see top left corner of the above screenshot).
In general, whatever metadata you log to Neptune, you’ll most likely be able to compare it. Apart from metrics and paramaters comparisons, that you could see in this article, it’s also the case with images or datasets artifacts.
Here’s an overview of comparison options in the Neptune docs.
Productionbased parameters
Until now we observed the comparable model features that take precedence in the development phase. Let’s dive into a few productioncentric features that accelerate the production and processing time.
Time complexity
Depending on the use case, the decision of choosing a model can be primarily focused on the time complexity. For example, for a realtime solution, it’s best to avoid the KNN classifier since it calculates the distance of new data points from the training points at the time of prediction which makes it a slow algorithm. However, for solutions that require batch processing, a slow predictor is not a big issue.
Note that the time complexities might differ during training and testing phases given the chosen model. For example, a decision tree has to estimate the decision points during training, whereas during prediction the model has to simply apply the conditions already available at the predecided decision points. So, if the solution requires frequent retraining like in a time series solution, choosing a model that has speed during both training and testing will be the way to go.
Space complexity
Citing the above example of KNN, every time the model needs to predict, it has to load the entire training data into the memory to compare distances. If the training data is sizable, this can become an expensive drain on the company’s resources such as RAM allotted for the particular solution or storage space. The RAM should always have enough room for processing and computation functions. Loading an overwhelming amount of data can be detrimental to the solution’s speed and processing capabilities.
Also here, Neptune proofs to be useful.
Andreas Malekos, Head of AI at Continuum Industries, says: “The ability to compare runs on the same graph was the killer feature, and being able to monitor production runs was an unexpected win that has proved invaluable.”
You can monitor the resource usage of each experiment:
And you can compare the usage for multiple experiments:
What next?
There’s no scarcity of comparable techniques to gauge the effectiveness of different machine learning models. However, the most important but often ignored requirement is to track the different comparable parameters such that the results can be confidently used and the chosen pipelines can be easily reproduced.
In this article, we explored a few popular methods for comparing machine learning models, but the list of these methods is much bigger, so if you didn’t find a good method for your project, keep looking!
Here are some resources I used when writing this article. They should give you extra context:
 How To Compare Machine Learning Algorithms in Python with scikitlearn
 How to Compare Machine Learning Algorithms
 Statistical Tests for Comparing Machine Learning and Baseline Performance
Also, check these examples and tutorials to see how to compare models in practice:
 How to Compare Groups of Runs and Identify the Best Performing Ones
 Tracking and Organizing ModelTraining Runs
 How to Version and Compare Datasets
 How to Compare Images Between Runs
Finally, here are a few case studies. Give them a read to find out how others compare models and track experiments:
 Comparing CI/CD Pipeline Runs at Continuum Industries
 Training and Comparing Over 120k Models at deepsense.ai
 Monitoring and Comparing Metrics at Hypefactors
 Tracking and Comparing Multiple Pipelines at ReSpo.Vision