Gradient Boosted Machines and their variants offered by multiple communities have gained a lot of traction in recent years. This has been primarily due to the improvement in performance offered by decision trees as compared to other machine learning algorithms, both in products and machine learning competitions.
Two of the most popular algorithms that are based on Gradient Boosted Machines are XGBoost and LightGBM. Although we might perform a benchmark test to check which of the two algorithms gives us an edge over the other, it is also important to understand the why and when when selecting them.
In this blog, we will go over the fundamental differences between XGBoost and LightGBM in order to help us in our machine learning experiments. But before we dive into the algorithms, let’s quickly understand the fundamental concept of Gradient Boosting, which is a part of both XGBoost and LightGBM.
What is Gradient Boosting?
Gradient Boosting refers to a method in machine learning where an ensemble of weak learners is used to improve the model performance in terms of efficiency, accuracy, and interpretability. These learners are defined as having better performance than random chance. Such models are typically decision trees, and their outputs are combined for better overall results.

The hypothesis is to filter out instances that are difficult to accurately predict and develop new weak learners to handle them.
The initial model is trained, and predictions are run on the whole dataset. The error between the actual value and prediction is calculated, and more weight is given to the incorrect predictions. Subsequently, a new model attempts to fix the error of the previous model, and in a similar way, several models are thus created. We arrive at the final model by weighting the mean of all models.
Gradient Boosting can be applied in the following scenarios:
- Regression: taking the average of the outputs by the weak learners
- Classification: finding the class prediction occurring the maximum number of times
Some of the most popular boosting algorithms widely used in enterprises and data science competitions are XGBoost, LightGBM, and CatBoost. Several data science practitioners spend a lot of time pondering over which algorithm to go for and often end up doing a lot of hits and trials.
Owing to the ever-rising popularity of XGBoost and LightGBM, we are going to explore both of these from a practical standpoint as well as a bit of theory to understand their pros and cons better.
LightBGM and XGBoost: the algorithms
XGBoost
The model supports the following kinds of boosting:
- Gradient Boosting as controlled by the learning rate.
- Stochastic Gradient Boosting that leverages sub-sampling at a row, column, or column per split level.
- Regularized Gradient Boosting using L1 (Lasso) and L2 (Ridge) regularization.
Some of the other features that are offered from a system performance point of view are:
- Using a cluster of machines to train a model using distributed computing.
- Utilization of all the available cores of a CPU during tree construction for parallelization.
- Out-of-core computing when working with datasets that do not fit into memory.
- Making the best use of hardware with cache optimization.
In addition to the above, the framework:
- Accepts multiple types of input data.
- Works well with sparse input data for tree and linear booster methods.
- Supports the use of customized objective and evaluation functions.

LightGBM
Similar to XGBoost, LightGBM (by Microsoft) is a distributed high-performance framework that uses decision trees for ranking, classification, and regression tasks.
The advantages are as follows:
- Faster training speed and accuracy resulting from LightGBM being a histogram-based algorithm that performs bucketing of values (also requires lesser memory).
- Also compatible with large and complex datasets but is much faster during training.
- Support for both parallel learning and GPU learning
In contrast to the level-wise (horizontal) growth in XGBoost, LightGBM carries out leaf-wise (vertical) growth that results in more loss reduction and, in turn, higher accuracy while being faster. But this may also result in overfitting on the training data, which could be handled using the max-depth parameter that specifies where the splitting would occur. Hence, XGBoost is capable of building more robust models than LightGBM.

Structural differences between XGBoost and LightGBM
From our discussion so far, it feels like both of the algorithms perform pretty well in their own right. LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. We might wonder, what are exactly the differences between LightGBM and XGBoost?
In this segment of the article, we are going to go over a few fundamental differences between the two algorithms: leaf growth, categorical feature handling, missing values handling, and respective feature importance methods.
Leaf growth
LightGBM has a faster rate of execution along with being able to maintain good accuracy levels primarily due to the utilization of two novel techniques:
1. Gradient-Based One-Side Sampling (GOSS):
- In gradient-boosted decision trees, the data instances have no native weight, which is leveraged by GOSS.
- Data instances with larger gradients contribute more towards information gain.
- To maintain the accuracy of the information, GOSS retains instances with larger gradients and performs random sampling on instances with smaller gradients.
- We can learn more about this concept in the article – What makes LightGBM lightning fast?
- The YouTube channel Machine Learning University also released a video on LightGBM speaking about GOSS.
2. Exclusive Feature Bundling (EFB):
- EFB is a near-lossless method that reduces the number of effective features.
- Just like One-Hot encoded features, in the sparse space, many features rarely take non-zero values simultaneously.
- To reduce dimensionality, improve efficiency, and maintain accuracy, EFB bundles these features; this bundle is called an Exclusive Feature Bundle.
- The thread on EFB and LightGBM’s paper can be referred to gain better insight.
On the other hand, XGBoost uses a pre-sorted and histogram-based algorithm to compute the best split, which is done with GOSS in LightGBM. The pre-sorting splitting works as follows:
- For each node, enumerate all features.
- For every feature, sort instances by the feature value.
- Using a linear scan, decide on the split along with the feature basis information gain.
- Pick the best-split solution along with all the features.
Handling categorical features
Both LightGBM and XGBoost accept numerical features only. This means that the nominal features in our data need to be transformed into numerical features.
Let’s assume we want to predict if it’s “Hot”, “Cold” or “Unknown” i.e., 2, 1, or 0. Since they are categorical features and are not ordered (which means that 1 greater than 0 actually is not a correct interpretation), the algorithm needs to avoid comparing the greatness of the two values and focus more on creating rules for equality.
XGBoost, by default, treats such variables as numerical variables with order, and we don’t want that. Instead, if we can create dummies for each of the categorical values (one-hot encoding), then XGBoost will be able to do its job correctly. However, this is a problem for larger datasets as encoding takes longer.
On the other hand, LightGBM accepts a parameter to check which column is categorical and handles this issue with ease by splitting on equality. However, the H2O library implements XGBoost, which supports the native handling of categorical features.
Handling missing values
Both algorithms treat missing values by assigning them to the side that reduces loss the most in each split.
Feature importance methods
Gain
- Every feature in a dataset has some sort of importance or weight in helping build an accurate model.
- Gain refers to the relative contribution of a particular feature in the context of a particular tree.
- This can also be understood by the extent of relevant information that the model gains from a feature for making better predictions.
- Available both in XGBoost and LightGBM.
Split/frequency/weight
- Split for LightGBM and Frequency or Weight for XGBoost method calculate the relative count of times a particular feature occurs in all splits of the model’s trees. One issue with this method is that it is prone to bias when there are a large number of categories in categorical features.
- Available both in XGBoost and LightGBM.
Coverage
- The relative number of observations per feature.
- Available only in XGBoost.
Other key differences between XGBoost and LightGBM
Processing unit
The algorithm we want to use depends upon the type of processing unit we have for running the models. Although XGBoost is comparatively slower than LightGBM on GPU, it is actually faster on CPU. LightGBM requires us to build the GPU distribution separately, while to run XGBoost on a GPU, we need to pass the ‘gpu_hist’ value to the ‘tree_method’ parameter when initializing the model.
When working in an institution with access to GPUs and strong CPUs, we should go for XGBoost as it is more scalable than LightGBM. Personally, I think LightGBM makes more sense as the training time saved can be used for better experimentation and feature engineering. We can train our final model to have model robustness.
Community
A framework is as good as the community behind maintaining it. In the case of XGBoost, it is much easier to work with as compared to LightGBM solely due to the strong community behind it and the availability of a myriad of helpful sources. Since LightGBM still has a smaller community, it becomes difficult to navigate to the solution when it comes to advanced issues and features.
Documentation
Since XGBoost has been around for longer and is one of the most popular algorithms for data science practitioners, it is extremely easy to work with due to the abundance of literature online surrounding it. This can either be in the form of framework documentation or errors/ issues faced by various users around the globe.
When looking at the two offerings at a high level, LightGBM’s documentation feels very comprehensive. But at the same time, XGBoost’s documentation feels very structured which in my opinion is easier to read and understand in comparison to LightGBM’s often wordy documentation, which of course is no deal-breaker as both of them keep improving.
Important hyperparameters
Decision tree-based algorithms can be complex and prone to overfitting. Depending on the data availability and various statistical metrics that give us an overall understanding of how the data is, it becomes important for us to make the right set of hyperparameter adjustments.
Since hyperparameter tuning and optimization is a broad topic on its own, we aim to get an overall understanding of some of the important hyperparameters for both XGBoost and LightGBM in the following subsections.
XGBoost parameters
Here are the most important XGBoost parameters:
- n_estimators [default 100] – Number of trees in the ensemble. A higher value means more weak learners contribute towards the final output, but increasing it slows down the training time significantly.
- max_depth [default 3] – This parameter decides the complexity of the algorithm. The lower the value assigned, the lower the ability of the algorithm to pick up most patterns (underfitting). A large value can make the model too complex and pick patterns that do not generalize well (overfitting).
- min_child_weight [default 1] – We know that an extremely deep tree can deliver poor performance due to overfitting. The min_child_weight parameter aims to regularize by limiting the depth of a tree. So, the higher the value of this parameter, the lower are the chances of the model overfitting on the training data.
- learning_rate/ eta [default 0.3] – The rate of learning of the model is inversely proportional to the accuracy of the model. Lowering the learning rate, although slower to train, improves the ability of the model to look for patterns and learn them. If the value is too low, then it raises difficulty in the model to converge.
- gamma/min_split_loss [default 0] – This is a regularization parameter that can range from 0 to infinity. The higher the value, the higher the strength of regularization, and the lower the chances of overfitting (but can underfit if it’s too large). Hence, this parameter varies across all types of datasets.
- colsample_bytree [default 1.0] – This parameter instructs the algorithm to use a fraction of the total number of features or predictors to be used for a tree during training. This means that every tree might use a different set of features for prediction and hence reduce the chances of overfitting and also improve the speed of training as not all the features are being used in every tree. The value ranges from 0 to 1.
- subsample [default 1.0] – Similar to colsample_bytree, the subsample parameter instructs the algorithm on the fraction of the total number of instances to be used for a tree during training. This also reduces the chances of overfitting and improves training time.
Here you can find more XGBoost parameters.
LightGBM parameters
Here are the most important LightGBM parameters:
- max_depth – Similar to XGBoost, this parameter instructs the trees not to go grow beyond the specified depth. A higher value increases the chances for the model to overfit.
- num_leaves – This parameter is very important in terms of controlling the complexity of the tree. The value should be less than 2^(max_depth), as a leaf-wise tree is much deeper than a depth-wise tree for a set number of leaves. Hence, a higher value can induce overfitting.
- min_data_in_leaf – The parameter is used to control overfitting. A higher value can stop the tree from growing too deep but can also lead the algorithm to learn less (underfitting). According to the LightGBM’s official documentation, as a best practice, it should be set to the order of hundreds or thousands.
- feature_fraction – Similar to colsample_bytree in XGBoost
- bagging_fraction – Similar to subsample in XGBoost
Find more parameters in the LightGBM documentation.
Tradeoff between model performance and training time
When working with machine learning models, one big aspect involved in the experimentation phase is the baseline requirement of resources to train a complex model. While some might have access to some great hardware, people often have limitations on what they can use.
Let us quickly create dummy datasets with sample sizes from 1,000 all the way to 20,000 samples. We’ll take a test size of 20% from each of the dummy datasets to measure model performance. For every iteration having different sample sizes stepped up by 1,000 samples, we want to check how much time it takes for an XGBoost classifier to train in comparison to a LightGBM classifier.
The best way to compare the results of both classifiers is to use an experiment-tracking solution like Neptune. Getting started with Neptune only takes a couple of minutes, so please sign up and create your first project. We will discuss the role of Neptune in the next section.
Disclaimer
Please note that this article references a deprecated version of Neptune.
For information on the latest version with improved features and functionality, please visit our website.
After you’ve saved your Neptune credentials as environment variables, install the necessary libraries for our experiments in a virtual environment:
pip install neptune xgboost lightgbm scikit-learn numpy tqdm neptune_xgboost neptune-lightgbm python-dotenv
Then, in a new Python script or Jupyter notebook, extract your credentials into variables:
import os
api_token = os.getenv("NEPTUNE_API_TOKEN")
project = os.getenv("NEPTUNE_PROJECT_NAME")
Now, import the necessary modules and libraries:
import neptune
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import time
import warnings
from tqdm.notebook import tqdm
warnings.filterwarnings('ignore')
That’s a lot of dependencies, so let’s go through them in more detail:
– neptune: A platform for experiment tracking and model registry.
– LGBMClassifier from lightgbm: LightGBM’s implementation of a gradient boosting classifier.
– XGBClassifier from xgboost: XGBoost’s implementation of a gradient boosting classifier.
– make_classification from sklearn.datasets: A function to generate synthetic classification datasets.
– train_test_split from sklearn.model_selection: A utility to split datasets into training and testing subsets.
– accuracy_score from sklearn.metrics: A function to calculate the accuracy of classification predictions.
– time: A module for time-related functions, used here for measuring execution time.
– warnings: A module to handle warning messages, used here to ignore warnings.
– tqdm from tqdm.notebook: A progress bar library, specifically the version for Jupyter notebooks.
Now, we create a function to create runs:
def create_run(name):
run = neptune.init_run(
project=os.getenv("NEPTUNE_PROJECT_NAME"),
api_token=os.getenv("NEPTUNE_API_TOKEN"),
custom_run_id=name
)
return run
A run object in Neptune corresponds to one machine-learning experiment. Since we are training two different models, we will create two run objects:
# Creating two separate experiments
lgbm_run = create_run('LightGBM')
xgb_run = create_run('XGBoost')
# Configuration for our custom dataset
min_samples = 1000
max_samples = 20000
step = 1000
Now, we will start a for loop. In each iteration of the loop, we create synthetic classification datasets, growing in size by 1000 samples, and split them into training and test sets.
for sample_size in tqdm(range(min_samples, max_samples + step, step)):
# Generating the dataset of custom sample size
X, y = make_classification(n_samples=sample_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
Then, we train both algorithms with default hyperparameters:
for sample_size in tqdm(range(min_samples, max_samples + step, step)):
# Generating the dataset of custom sample size
...
# XGBoost training
xgb_model = XGBClassifier(random_state=42, verbosity=0)
start = time.time()
xgb_model.fit(X_train, y_train)
end = time.time()
xgb_runtime = end - start
xgb_accuracy = accuracy_score(y_test, xgb_model.predict(X_test))
# LightGBM training
lgbm_model = LGBMClassifier(random_state=42, verbosity=-1)
start = time.time()
lgbm_model.fit(X_train, y_train)
end = time.time()
lgbm_runtime = end - start
lgbm_accuracy = accuracy_score(y_test, lgbm_model.predict(X_test))
Afterward, we log the runtime and accuracy of both models using the earlier-created run objects:
for sample_size in tqdm(range(min_samples, max_samples + step, step)):
# Generating the dataset of custom sample size
...
# XGBoost training
...
# LightGBM training
...
# Logging
lgbm_run["metrics/comparison/runtime"].append(lgbm_runtime, step=sample_size)
lgbm_run["metrics/comparison/accuracy"].append(lgbm_accuracy, step=sample_size)
xgb_run["metrics/comparison/accuracy"].append(xgb_accuracy, step=sample_size)
xgb_run["metrics/comparison/runtime"].append(xgb_runtime, step=sample_size)
We will use the append() method of the run object that logs any metric to the provided namespace. Logging the same metric calculated over different iterations under the same namespace creates a line chart with sample size on the X-axis, as you will see in a moment.
Now, we stop both run objects:
xgb_run.stop()
lgbm_run.stop()
💡 You can find the full code for the experiments in this GitHub Gist.
Once both runs finish, you can visit your Neptune project dashboard. Here is what my Neptune project looks like:
From here, you can display the accuracy and runtime of both models by clicking on the Eye icons:
Let’s look at the two plots more closely.
From the figure, we can see that the training time for XGBoost kept on increasing with an increase in sample size. On the other hand, the training time required by LightGBM has been a very small fraction of its contenders. Interesting!
But what about the respective model performances? Let’s check for the gap between the accuracy scores between the two models at all these varying sample sizes.
From the above chart, we see a very surprising result. The accuracy scores for both models go hand-in-hand. The results indicate that not only is LightGBM faster, but there is also not much compromise in model performances. So does this mean we can just ditch XGBoost for LightGBM?
It all comes down to the availability of hardware resources and bandwidth to figure things out. Although LightGBM gives good performance in a fraction of the time as compared to XGBoost, what it still needs to improve on is documentation and community strength. Also, if the hardware is available, since XGBoost scales better, as discussed before, we could train using LightGBM, get an understanding of the parameters required, and train the final model as an XGBoost model.
Summary
Gradient Boosted Decision Trees (GBDTs) are one of the most popular choices of machine learning algorithms. XGBoost and LightGBM, which are based on GBDTs, have had great success both in enterprise applications and data science competitions.
Here are the key takeaways from our comparison:
- In XGBoost, trees grow depth-wise, while in LightGBM, trees grow leaf-wise, which is the fundamental difference between the two frameworks.
- XGBoost benefits from a large user base, resulting in extensive documentation and a wealth of resources for issue resolution. LightGBM, while powerful, has not yet achieved the same level of comprehensive documentation and community support.
- Both the algorithms perform similarly in terms of model performance, but LightGBM training happens within a fraction of the time required by XGBoost.
- Fast training in LightGBM makes it the go-to choice for machine learning experiments.
- XGBoost requires a lot of resources to train on large amounts of data, which makes it an accessible option for most enterprises, while LightGBM is lightweight and can be used on modest hardware.
- LightGBM provides the option for passing feature names that are to be treated as categories and handles this issue with ease by splitting on equality.
- H2O’s implementation of XGBoost provides the above feature as well, which is not yet provided by XGBoost’s original library.
- Hyperparameter tuning is extremely important in both algorithms.
Both algorithms have set the gold standard in terms of model performance, and it’s completely up to the user to select primarily on the basis of the nature of categorical features and the size of data.

