Machine learning has expanded rapidly in the last few years. Instead of simple, one-directional, or linear ML pipelines, today data scientists and developers run multiple parallel experiments that can get overwhelming even for large teams. Each experiment is expected to be recorded in an immutable and reproducible format, which results in endless logs with invaluable details.

We need to narrow down on techniques by comparing machine learning models thoroughly with parallel experiments. Using a well-planned approach is necessary to understand how to choose the right combination of algorithms and the data at hand.

So, in this article, we’re going to explore how to approach comparing ML models and algorithms.

### Read also

## The challenge of model selection

Each model or any machine learning algorithm has several features that process the data in different ways. Often the data that is fed to these algorithms is also different depending on previous experiment stages. But, since machine learning teams and developers usually record their experiments, there’s ample data available for comparison.

The challenge is to understand which parameters, data, and metadata must be considered to arrive at the final choice. It’s the classic paradox of having an overwhelming amount of details with no clarity.

Even more challenging, we need to understand if a parameter with a high value, say a higher metric score, actually means the model is better than one with a lower score, or if it’s only caused by statistical bias or misdirected metric design.

## The goal of comparing machine learning algorithms

Comparing machine learning algorithms is important in itself, but there are some not-so-obvious benefits of comparing various experiments effectively. Let’s take a look at the goals of comparison:

**Better performance**

The primary objective of model comparison and selection is definitely better performance of the machine learning software/solution. The objective is to narrow down on the best algorithms that suit both the data and the business requirements.

**Longer lifetime**

High performance can be short-lived if the chosen model is tightly coupled with the training data and fails to interpret unseen data. So, it’s also important to find the model that understands underlying data patterns so that the predictions are long-lasting and the need for re-training is minimal.

**Easier retraining**

When models are evaluated and prepared for comparisons, minute details, and metadata get recorded which come in handy during retraining. For example, if a developer can clearly retrace the reasons behind choosing a model, the causes of model failure will immediately pop out and re-training can start with equal speed.

**Speedy production**

With the model details available at hand, it’s easy to narrow down on models that can offer high processing speed and use memory resources optimally. Also during production, several parameters are required to configure the machine learning solutions. Having production-level data can be useful for easily aligning with the production engineers. Moreover, knowing the resource demands of different algorithms, it will also be easier to check their compliance and feasibility with respect to the organization’s allocated assets.

## Parameters of machine learning algorithms and how to compare them

Let’s dive right into analyzing and understanding how to compare the different characteristics of algorithms that can be used to sort and choose the best machine learning models. The comparable parameters have been divided into two high-level categories:

- development-based, and
- production-based parameters.

### Development-based parameters

#### Statistical tests

On a fundamental level, machine learning models are statistical equations that run at great speed on multiple data points to arrive at a conclusion. Therefore, conducting statistical tests on the algorithms is critical to set them right and also to understand if the model’s equation is the right fit for the dataset at hand. Here’s a handful of popular statistical tests that can be used to set the grounds for comparison:

**Null hypothesis testing:**Null hypothesis testing is used to determine if the differences in two data samples or metric performances are statistically significant or more-or-less equal and caused only by noise or coincidence.**ANOVA:**Analysis Of Variance, it’s similar to Linear Discriminant analysis with the exception of the fact that it uses one or more categorical features and one continuous target, providing the statistical test of whether the means of the different groups are similar or not.**Chi-Square:**It’s a statistical tool or test which can be used on groups of categorical features to evaluate the likelihood of association or correlation with the help of frequency distributions.**Student’s t-test:**It compares the averages or means of different samples from normal distributions when the standard deviation is unknown to determine if the differences are statistically significant.**Ten-fold cross-validation:**The 10-fold cross-validation compares the performance of each algorithm on different datasets that have been configured with the same random seed so as to maintain uniformity in testing. Next, a hypothesis test like the student’s paired t-test should be deployed to validate if the differences in metrics between the two models are statistically significant.

Here’s how Neptune can be leveraged to track hypothesis tests or even to integrate them with project management tools such as Jira:

#### Model features and objectives

To choose the best machine learning model for a given dataset, it’s essential to consider the features or parameters of the model. The parameters and model objectives help to gauge the model’s flexibility, assumptions, and learning style.

For example, if two linear regression models are compared, one model might be aiming to reduce the mean squared error, whereas another might be aiming to reduce the mean absolute error through objective functions. To understand if the second model is a better fit, we need to understand if the outliers in the data influence the results, or if they’re not supposed to affect the data. If the anomalies or the outliers must be considered, using the second model with objective function as the mean absolute error will be the right choice.

Similarly for classification, if two models (for example, decision tree and random forest) are considered, then the primary basis for comparison will be the degree of generalization that the model can achieve. A decision tree model with just one tree will have a limited ability to reduce variance through the max_depth parameter, whereas random forest will have an extended ability to bring generalization via both max_depth and n_estimators parameters.

There are several other behavioural features of the model that can be taken into account, like the type of assumptions made by the model, parametricity, speed, learning styles (tree-based vs non-tree based), and more.

Parallel coordinates can be used to see how different model parameters affect the metrics:

### Learn more

Check how to use Parallel coordinates comparison to see how model parameters affect the metrics.

#### Learning curves

Learning curves can help in determining if a model is on the correct learning trajectory of achieving the bias-variance tradeoff. It also provides a basis for comparing different machine learning models – a model with stable learning curves across both training and validation sets is likely going to perform well over a longer period on unseen data.

**Bias** is the assumption used by machine learning models to make the learning process easier. **Variance** is the measure of how much the estimated target variable will change with a change in training data. The ultimate goal is to reduce both bias and variance to a minimum – a state of high stability with few assumptions.

Bias and variance are indirectly proportional to each other, and the only way to reach a minimum point for both is at the intersection point. One way to understand if a model has achieved a significant level of trade-off is to see if its performance across training and testing datasets is nearly similar.

The best way to track the progress of model training is to use learning curves. These curves help to identify the optimal combinations of hyperparameters and assists massively in model selection and model evaluation. Typically, a learning curve is a way to track the learning or improvement in model performance on the y-axis and the time or experience on the x-axis.

The two most popular learning curves are:

**Training Learning Curve**– It effectively plots the evaluation metric score over time during a training process, thus helping to track the learning or progress of the model during training.**Validation Learning Curve**– In this curve, the evaluation metric score is plotted against time on the validation set.

Sometimes it might happen that the training curve shows an improvement but the validation curve shows stunted performance. This is indicative of the fact that the model is overfitting and needs to be reverted to the previous iterations. In other words, the validation learning curve identifies how well the model is ** generalizing**.

Therefore, there’s a tradeoff between the training learning curve and the validation learning curve and the model selection technique must rely upon the point where both the curves intersect and are at their lowest.

You can see an example learning curve here:

### Dig deeper

Check how to use Charts comparison to compare learning curves for metrics or losses.

#### Loss functions and metrics

Often there’s confusion between loss functions and metric functions. Loss functions are used for model optimization or model tuning, whereas metric functions are used for model evaluation and selection. However, since regression accuracy can’t be calculated, the same metrics are used to evaluate performance as well as model error for optimization.

Loss functions are passed as arguments to the models such that the models can get tuned to minimize the loss function. A high penalty is provided by the loss function when the model settles on incorrect judgement.

### Learn more

**Loss Functions and Metrics for Regression:**

**Mean Square Error:**measures the average of the squares of the errors or deviations, that is, the difference between the estimated and true value. It aids in imposing higher weights on outliers, thus reducing the issue of overfitting.**Mean Absolute Error:**It’s the absolute difference between the estimated value and true value. It decreases the weight for outlier errors when compared to the mean squared error.**Smooth Absolute Error:**It’s the absolute difference between the estimated value and true value for the predictions lying close to the real value, and it’s the square of the difference between the estimated and the true values of the outliers (or points far off from predicted values). Essentially, it’s a combination of MSE and MAE.

**Loss Functions for Classification:**

**0-1 loss function:**This is like counting the number of misclassified samples. There’s nothing more to it. It can easily be determined from a confusion matrix which shows the number of misclassifications and correct classifications. It’s designed to penalize misclassifications and to assign the smallest loss to the solution that has the greatest number of correct classifications.**Hinge loss function (L2 regularized):**The hinge loss is used for maximum-margin classification, most notably for support vector machines (SVMs). Basically, it measures squared distance from a margin between two different classes and the straight lines running parallel through the nearest points in the classes on either side of the margin.**Logistic Loss:**This function displays a similar convergence rate to the hinge loss function, and since it’s continuous (unlike Hinge Loss), gradient descent methods can be used. However, the logistic loss function doesn’t assign zero penalties to any point. Instead, functions that correctly classify points with high confidence are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data.**Cross entropy/log loss:**Measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

There are several other loss functions that can be used to optimize machine learning models. The ones mentioned above are the primary ones that essentially built the foundation for the development of the model design.

**Metrics for Classification:**

For every classification model prediction, a matrix called the confusion matrix can be constructed which demonstrates the number of test cases correctly and incorrectly classified. It looks something like this (considering 1 – Positive and 0 – Negative are the target classes):

Actual 0 | Actual 1 | |

Predicted 0 | True Negatives (TN) | False Negatives (FN) |

Predicted 1 | False Positives (FP) | True Positives (TP) |

- TN: Number of negative cases correctly classified
- TP: Number of positive cases correctly classified
- FN: Number of positive cases incorrectly classified as negative
- FP: Number of negative cases correctly classified as positive

**Accuracy**

Accuracy is the simplest metric and can be defined as the number of test cases correctly classified divided by the total number of test cases.

It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets. For instance, if we’re detecting fraud in bank data, the ratio of fraud to non-fraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as non-fraud.

This is why accuracy is a false indicator of model health, and for such a case, a metric is required that can focus on the fraud data points.

**Precision**

Precision is the metric used to identify the correctness of classification.

Intuitively, this equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, the higher the precision, which means better ability of the model to correctly classify the positive class.

**Recall**

Recall tells us the number of positive cases correctly identified out of the total number of positive cases.

**F1 Score**

F1 score is the harmonic mean of Recall and Precision, therefore it balances out the strengths of each. It’s useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and recall will be required to ensure that the machinery is stable and not a threat to human lives.

**AUC-ROC**

ROC curve is a plot of true positive rate (recall) against false positive rate (TN / (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and the higher the area, the better the model performance. If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.

### Production-based parameters

Until now we observed the comparable model features that take precedence in the development phase. Let’s dive into a few production-centric features that accelerate the production and processing time.

#### Time complexity

Depending on the use case, the decision of choosing a model can be primarily focused on the time complexity. For example, for a real-time solution, it’s best to avoid the K-NN classifier since it calculates the distance of new data points from the training points at the time of prediction which makes it a slow algorithm. However, for solutions that require batch processing, a slow predictor is not a big issue.

Note that the time complexities might differ during training and testing phases given the chosen model. For example, a decision tree has to estimate the decision points during training, whereas during prediction the model has to simply apply the conditions already available at the pre-decided decision points. So, if the solution requires frequent retraining like in a time series solution, choosing a model that has speed during both training and testing will be the way to go.

#### Space complexity

Citing the above example of K-NN, every time the model needs to predict, it has to load the entire training data into the memory to compare distances. If the training data is sizable, this can become an expensive drain on the company’s resources such as RAM allotted for the particular solution or storage space. The RAM should always have enough room for processing and computation functions. Loading an overwhelming amount of data can be detrimental to the solution’s speed and processing capabilities.

It’s easy to keep track of the resource usage of different experiments through Neptune’s user-friendly dashboard that helps in organizing ML experiments:

### Learn more

Check how to monitor your hardware consumption live.

**Online and offline learning**

Online learning is when a machine learning solution can instantly update the new data in real-time, whereas offline learning takes more time and re-learns the entire training set from scratch while updating model parameters. For real-time solutions, online learning is the most suitable choice.

**Parallel processing capability**

The ability of parallel processing is a boon for time-taking algorithms like K-NN, where distance calculation can be distributed across several machines. Also, the random forest model can benefit from parallel processing by distributing the estimators/trees. However, for certain machine learning models which are designed to be sequential, like gradient boosting, parallel processing isn’t possible since the result of one tree needs to be fed to the next one.

## Final note

There’s no scarcity of comparable techniques to gauge the effectiveness of different machine learning models. However, the most important but often ignored requirement is to track the different comparable parameters such that the results can be confidently used and the chosen pipelines can be easily reproduced.

In this article, we explored a few popular methods for comparing machine learning models, but the list of these methods is much bigger, so if you didn’t find a good method for your project, keep looking!

**Sources:**

- https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/
- https://towardsdatascience.com/how-to-compare-machine-learning-algorithms-ccc266c4777
- https://towardsdatascience.com/statistical-tests-for-comparing-machine-learning-and-baseline-performance-4dfc9402e46f
- https://docs.neptune.ai/how-to-guides/experiment-tracking/organize-ml-experiments

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 mins read | Author Jakub Czakon | Updated July 14th, 2021

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->