Gradient Boosted Machines and their variants offered by multiple communities have gained a lot of traction in recent years. This has been primarily due to the improvement in performance offered by decision trees as compared to other machine learning algorithms both in products and machine learning competitions.
Two of the most popular algorithms that are based on Gradient Boosted Machines are XGBoost and LightGBM. Although we might perform a hit and trial to check which of the two algorithms gives us an edge over the other, it is also important to understand the why and when during selecting them.
In this blog, we will go over the fundamental differences between XGBoost and LightGBM in order to help us in our machine learning experiments. But before we dive into the algorithms, let’s quickly understand the fundamental concept of Gradient Boosting that is a part of both XGBoost and LightGBM.
What is Gradient Boosting?
Gradient Boosting refers to a methodology in machine learning where an ensemble of weak learners is used to improve the model performance in terms of efficiency, accuracy, and interpretability. These learners are defined as having better performance than random chance. Such models are typically decision trees and their outputs are combined for better overall results.
The hypothesis is to filter out instances that are difficult to accurately predict and develop new weak learners to handle them.
The initial model is trained and predictions are run on the whole dataset. The error between the actual value and prediction is calculated and more weightage is given to the incorrect predictions. Subsequently, a new model that attempts to fix the error of the previous model and in a similar way several models are thus created. We arrive at the final model by weighting the mean of all models.
Gradient Boosting can be applied in the following scenarios:
- Regression – taking the average of the outputs by the weak learners
- Classification – finding the class prediction occurring the maximum number of times
Some of the most popular boosting algorithms widely used in enterprises and data science competitions are XGBoost, LightGBM, and CatBoost. Several data science practitioners spend a lot of time pondering over which algorithm to go for and often end up doing a lot of hits and trials.
Owing to the ever-rising popularity of XGBoost and LightGBM, we are going to explore both of these from a practical standpoint as well as a little bit of theory to understand their pros and cons a little better.
LightBGM and XGBoost: the algorithms
XGBoost (eXtreme Gradient Boosting) is a machine learning algorithm that focuses on computation speed and model performance. It was introduced by Tianqi Chen and is currently a part of a wider toolkit by DMLC (Distributed Machine Learning Community). The algorithm can be used for both regression and classification tasks and has been designed to work with large and complicated datasets.
The model supports the following kinds of boosting:
- Gradient Boosting as controlled by the learning rate
- Stochastic Gradient Boosting that leverages sub-sampling at a row, column or column per split levels
- Regularized Gradient Boosting using L1 (Lasso) and L2 (Ridge) regularization
Some of the other features that are offered from a system performance point of view are:
- Using a cluster of machines to train a model using distributed computing
- Utilization of all the available cores of a CPU during tree construction for parallelization
- Out-of-core computing when working with datasets that do not fit into memory
- Making the best use of hardware with cache optimization
In addition to the above the framework:
- Accepts multiple types of input data
- Works well with sparse input data for tree and linear booster
- Supports the use of customized objective and evaluation functions
Similar to XGBoost, LightGBM (by Microsoft) is a distributed high-performance framework that uses decision trees for ranking, classification, and regression tasks.
The advantages are as follows:
- Faster training speed and accuracy resulting from LightGBM being a histogram-based algorithm that performs bucketing of values (also requires lesser memory)
- Also compatible with large and complex datasets but is much faster during training
- Support for both parallel learning and GPU learning
In contrast to the level-wise (horizontal) growth in XGBoost, LightGBM carries out leaf-wise (vertical) growth that results in more loss reduction and in turn higher accuracy while being faster. But this may also result in overfitting on the training data which could be handled using the max-depth parameter that specifies where the splitting would occur. Hence, XGBoost is capable of building more robust models than LightGBM.
Structural differences between XGBoost and LightGBM
From our discussion so far, it feels like both of the algorithms perform pretty well in their own right. LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. We might wonder, what are exactly the differences between LightGBM and XGBoost?
In this segment of the article, we are going to go over a few fundamental differences between the two algorithms – leaf growth, categorical feature handling, missing values handling, and respective feature importance methods.
LightGBM has a faster rate of execution along with being able to maintain good accuracy levels primarily due to the utilization of two novel techniques:
1. Gradient-Based One-Side Sampling (GOSS):
- In Gradient Boosted Decision Trees, the data instances have no native weight which is leveraged by GOSS.
- Data instances with larger gradients contribute more towards information gain.
- To maintain the accuracy of the information, GOSS retains instances with larger gradients and performs random sampling on instances with smaller gradients.
- We can learn more about this concept in the article – What makes LightGBM lightning fast?
- The YouTube channel Machine Learning University also released a video on LightGBM speaking about GOSS.
2. Exclusive Feature Bundling (EFB):
- EFB is a near lossless method to reduce the number of effective features.
- Just like One-Hot encoded features, in the sparse space, many features rarely take non-zero values simultaneously.
- To reduce dimensionality, improve efficiency, and maintain accuracy, EFB bundles these features, and this bundle is called an Exclusive Feature Bundle.
- This thread on EFB and LightGBM’s paper can be referred to gain better insight.
On the other hand, XGBoost uses a pre-sorted and histogram-based algorithm for computing the best split, which is done with GOSS in LightGBM. The pre-sorting splitting works as:
- For each node, enumerate over all features
- For every feature, sort instances by the feature value
- Using linear scan, decide the split along with the feature basis information gain
- Pick the best-split solution along with all the features
Handling Categorical Features
Both LightGBM and XGBoost accept numerical features only. This means that the nominal features in our data need to be transformed into numerical features.
Let’s assume we want to predict if it’s “Hot”, “Cold” or “Unknown” i.e. 2, 1, or 0. Since they are categorical features and are not ordered (which means that 1 greater than 0 actually is not a correct interpretation), the algorithm needs to avoid comparing the greatness of the two values and focus more on creating rules for equality.
XGBoost, by default, treats such variables as numerical variables with order and we don’t want that. Instead, if we can create dummies for each of the categorical values (one-hot encoding), then XGboost will be able to do its job correctly. But for larger datasets, this is a problem as encoding takes a longer time.
On the other hand, LightGBM accepts a parameter to check which column is a categorical column and handles this issue with ease by splitting on equality. However, the H2O library provides an implementation of XGBoost that supports the native handling of categorical features.
Handling Missing Values
Both the algorithms treat missing values by assigning them to the side that reduces loss the most in each split.
Feature Importance methods
- Every feature in a dataset has some sort of importance/ weightage in helping build an accurate model.
- Gain refers to the relative contribution of a particular feature in the context of a particular tree.
- This can also be understood by the extent of relevant information that the model gains from a feature for making better predictions.
- Available both in XGBoost and LightGBM.
- Split for LightGBM and Frequency or Weight for XGBoost method calculates the relative count of times a particular feature occurs in all splits of the model’s trees. One issue with this method is that it is prone to bias when there are a large number of categories in categorical features.
- Available both in XGBoost and LightGBM.
- The relative number of observations per feature.
- Available only in XGBoost.
Other key differences between XGBoost and LightGBM
The algorithm we want to use often depends upon the type of processing unit we have for running the models. Although XGBoost is comparatively slower than LightGBM on GPU, it is actually faster on CPU. LightGBM requires us to build the GPU distribution separately while to run XGBoost on GPU we need to pass the ‘gpu_hist’ value to the ‘tree_method’ parameter when initializing the model.
When working in an institution with access to GPUs and strong CPUs, we should go for XGBoost as it is more scalable than LightGBM. But personally, I think LightGBM makes more sense as the training time saved can be used for better experimentation and feature engineering. We can train our final model to have model robustness.
A framework is as good as the community behind maintaining it. In the case of XGBoost, it is much easier to work with as compared to LightGBM solely due to the strong community behind it and the availability of a myriad of helpful sources. Since LightGBM still has a smaller community, when it comes to issues and features that are advanced, it becomes a bit difficult to navigate to the solution.
Since XGBoost has been around for longer and is one of the most popular algorithms for data science practitioners, it is extremely easy to work with due to the abundance of literature online surrounding it. This can either be in the form of framework documentation or errors/ issues faced by various users around the globe.
When looking at the two offerings at a high level, LightGBM’s documentation feels very comprehensive. But at the same time, XGBoost’s documentation feels very structured which in my opinion is easier to read and understand in comparison to LightGBM’s often wordy documentation, which of course is no deal-breaker as both of them keep improving.
Decision Tree-based algorithms can be complex and prone to overfitting. Depending on the data availability and various statistical metrics that give us an overall understanding of how the data is, it becomes important for us to do the right set of hyperparameter adjustments.
Since hyperparameter tuning/ optimization is a broad topic on its own, in the following subsections we are going to aim to get an overall understanding of some of the important hyperparameters for both XGBoost and LightGBM.
Here are the most important XGBoost parameters:
- n_estimators [default 100] – Number of trees in the ensemble. A higher value means more weak learners contribute towards the final output but increasing it significantly slows down the training time.
- max_depth [default 3] – This parameter decides the complexity of the algorithm. The lesser the value assigned, the lower is the ability for the algorithm to pick up most patterns (underfitting). A large value can make the model too complex and pick patterns that do not generalize well (overfitting).
- min_child_weight [default 1] – We know that an extremely deep tree can deliver poor performance due to overfitting. The min_child_weight parameter aims to regularise by limiting the depth of a tree. So, the higher the value of this parameter, the lower are the chances of the model overfitting on the training data.
- learning_rate/ eta [default 0.3] – The rate of learning of the model is inversely proportional to the accuracy of the model. Lowering the learning rate, although slower to train, improves the ability of the model to look for patterns and learn them. If the value is too low then it raises difficulty in the model to converge.
- gamma/ min_split_loss [default 0] – This is a regularization parameter that can range from 0 to infinity. Higher the value, higher is the strength of regularization, lower are the chances of overfitting (but can underfit if it’s too large). Hence, this parameter varies across all types of datasets.
- colsample_bytree [default 1.0] – This parameter instructs the algorithm on the fraction of the total number of features/ predictors to be used for a tree during training. This means that every tree might use a different set of features for prediction and hence reduce the chances of overfitting and also improve the speed of training as not all the features are being used in every tree. The value ranges from 0 to 1.
- subsample [default 1.0] – Similar to colsample_bytree, the subsample parameter instructs the algorithm on the fraction of the total number of instances to be used for a tree during training. This also reduces the chances of overfitting and improves training time.
Find more parameters here.
Here are the most important LightGBM parameters:
- max_depth – Similar to XGBoost, this parameter instructs the trees to not grow beyond the specified depth. A higher value increases the chances for the model to overfit.
- num_leaves – This parameter is very important in terms of controlling the complexity of the tree. The value should be less than 2^(max_depth) as a leaf-wise tree is much deeper than a depth-wise tree for a set number of leaves. Hence, a higher value can induce overfitting.
- min_data_in_leaf – The parameter is used for controlling overfitting. A higher value can stop the tree from growing too deep but can also lead the algorithm to learn less (underfitting). According to the LightGBM’s official documentation, as a best practice, it should be set to the order of hundreds or thousands.
- feature_fraction – Similar to colsample_bytree in XGBoost
- bagging_fraction – Similar to subsample in XGBoost
Find more parameters here.
Might interest you
Tradeoff between model performance and training time
When working with machine learning models, one big aspect involved in the experimentation phase is the baseline requirement of resources to train a complex model. While some might have access to some great hardware, often people have limitations to what they can use.
Let us quickly dummy datasets with sample sizes from 1,000 all the way to 20,000 samples. We’ll take a test size of 20% from each of the dummy datasets to measure model performance. For every iteration having different sample sizes stepped up by 1,000 samples, we want to check how much time it takes for an XGBoost Classifier to train in comparison to a LightGBM Classifier.
import neptune.new as neptune from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import make_classification import time from tqdm.notebook import tqdm # initialising a logger instance run = neptune.init( project="common/xgboost-integration", api_token="ANONYMOUS", name="xgb-train", tags=["xgb-integration", "train"], ) # configuration for our custom dataset min_samples = 1000 max_samples = 20000 step = 1000 for sample_size in tqdm(range(int(min_samples/0.8), int(max_samples/0.8), step)): xgb_dummy = XGBClassifier(seed=47) lgbm_dummy = LGBMClassifier(random_sate=47) # logging the sample size run['metrics/comparison/sample_size'].log(sample_size) # generating the dataset of custom sample size dummy = make_classification(n_samples=sample_size) # splitting the data into train and test set X_train, X_test, y_train, y_test = train_test_split(dummy, dummy, test_size=0.2, stratify=dummy) start = time.time() xgb_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/xgb_runtime'].log(end-start) # logging model performance run['metrics/comparison/xbg_accuracy'].log( accuracy_score(y_test, xgb_dummy.predict(X_test)), step=sample_size) start = time.time() lgbm_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/lgbm_runtime'].log(end-start) # logging model performance run['metrics/comparison/lgbm_accuracy'].log( accuracy_score(y_test, lgbm_dummy.predict(X_test)), step=sample_size) run.stop()
We were able to log the runtime and accuracy metrics for every sample size for which Neptune automatically generated charts for reference. To run the code, you can refer to this Google Colab Notebook.
Neptune dashboards are an easy way to combine all the results from our project for a comparative study in the end. For the purposes of this blog, I have compared the runtimes of the two algorithms as well as their respective performances – dashboard link.
From the figure, we can see that the training time for XGBoost kept on increasing with an increase in sample size almost linearly. On the other hand, the training time required by LightGBM has been a very small fraction of its contender. Interesting!
But what about the respective model performances? Let’s check for the gap between the accuracy scores between the two models at all these varying sample sizes.
From the above chart, we see a very surprising result. The accuracy scores for both models go hand-in-hand. The results indicate that not only is LightGBM faster, there is not much compromise in model performances. So does this mean we can just ditch XGBoost for LightGBM?
It all comes down to the availability of hardware resources and bandwidth to figure things out. Although LightGBM gives good performance at fraction of time as compared to XGBoost, what it still needs to improve on is documentation and community strength. Also if the hardware is available, since XGBoost scales better, as discussed before we could train using LightGBM, get an understanding of the parameters required, and train the final model as an XGBoost model.
Gradient Boosted Decision Trees (GBDTs) are one of the most popular choices of machine learning algorithms. XGBoost and LightGBM which are based on GBDTs have had great success both in enterprise applications and data science competitions.
Here are the key takeaways from our comparison:
- In XGBoost, trees grow depth-wise while in LightGBM, trees grow leaf-wise which is the fundamental difference between the two frameworks.
- XGBoost is backed by the volume of its users that results in enriched literature in the form of documentation and resolutions to issues. While LightGBM is yet to reach such a level of documentation.
- Both the algorithms perform similarly in terms of model performance but LightGBM training happens within a fraction of the time required by XGBoost.
- Fast training in LightGBM makes it the go-to choice for machine learning experiments.
- XGBoost requires a lot of resources to train on large amounts of data which makes it an accessible option for most enterprises while LightGBM is lightweight and can be used on modest hardware.
- LightGBM provides the option for passing feature names that are to be treated as categories and handles this issue with ease by splitting on equality.
- H2O’s implementation of XGBoost provides the above feature as well which is not yet provided by XGBoost’s original library.
- Hyperparameter tuning is extremely important in both algorithms.
Both the algorithms have set the gold standard in terms of output model performance and it’s completely up to the user to select primarily on the basis of the nature of categorical features and the size of data.
How to Compare Machine Learning Models and Algorithms
9 mins read | Author Samadrita Ghosh | Updated September 16th, 2021
Machine learning has expanded rapidly in the last few years. Instead of simple, one-directional, or linear ML pipelines, today data scientists and developers run multiple parallel experiments that can get overwhelming even for large teams. Each experiment is expected to be recorded in an immutable and reproducible format, which results in endless logs with invaluable details.
We need to narrow down on techniques by comparing machine learning models thoroughly with parallel experiments. Using a well-planned approach is necessary to understand how to choose the right combination of algorithms and the data at hand.
So, in this article, we’re going to explore how to approach comparing ML models and algorithms.
The challenge of model selection
Each model or any machine learning algorithm has several features that process the data in different ways. Often the data that is fed to these algorithms is also different depending on previous experiment stages. But, since machine learning teams and developers usually record their experiments, there’s ample data available for comparison.
The challenge is to understand which parameters, data, and metadata must be considered to arrive at the final choice. It’s the classic paradox of having an overwhelming amount of details with no clarity.
Even more challenging, we need to understand if a parameter with a high value, say a higher metric score, actually means the model is better than one with a lower score, or if it’s only caused by statistical bias or misdirected metric design.Continue reading ->