We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Blog » General » XGBoost: Everything You Need to Know

XGBoost: Everything You Need to Know

Gradient boosting (GBM) trees learn from data without a specified model, they do unsupervised learning. XGBoost is a popular gradient-boosting library for GPU training, distributed computing, and parallelization. It’s precise, it adapts well to all types of data and problems, it has excellent documentation, and overall it’s very easy to use. 

At the moment it’s the de facto standard algorithm for getting accurate results from predictive modeling with machine learning. It’s the fastest gradient-boosting library for R, Python, and C++ with very high accuracy.

We’re going to explore how to use the model, meanwhile using Neptune to present and detail some best practices for ML project management in general.  

If you’re not sure that XGBoost is a great choice for you, follow along with the tutorial until the end, and then you’ll be able to make a fully informed decision.

Ensemble algorithms – explanation

Ensemble learning combines several learners (models) to improve overall performance, increasing predictiveness and accuracy in machine learning and predictive modeling.

Technically speaking, the power of ensemble models is simple: it can combine thousands of smaller learners trained on subsets of the original data. This can lead to interesting observations that, like:

  • The variance of the general model decreases significantly thanks to bagging
  • The bias also decreases due to boosting 
  • And overall predictive power improves because of stacking  

Types of ensemble methods

Ensemble methods can be classified into two groups based on how the sub-learners are generated:

  1. Sequential ensemble methods – learners are generated sequentially. These methods use the dependency between base learners. Each learner influences the next one, likewise, a general paternal behavior can be deduced. A popular example of sequential ensemble algorithms is AdaBoost. 
  2. Parallel ensemble methods – learners are generated in parallel. The base learners are created independently to study and exploit the effects related to their independence and reduce error by averaging the results. An example implementing this approach is Random Forest.

Ensemble methods can use homogeneous learners (learners from the same family) or heterogeneous learners (learners from multiple sorts, as accurate and diverse as possible).

Homogenous and heterogenous ML algorithms  

Generally speaking, homogeneous ensemble methods have a single-type base learning algorithm. The training data is diversified by assigning weights to training samples, but they usually leverage a single type base learner. 

Heterogeneous ensembles on the other hand consist of members having different base learning algorithms which can be combined and used simultaneously to form the predictive model. 

A general rule of thumb: 

  • Homogeneous ensembles use the same feature selection with a variety of data and distribute the dataset over several nodes.
  • Heterogeneous ensembles use different feature selection methods with the same data

Homogeneous Ensembles:

Heterogeneous Ensembles:

Important characteristics of ensemble algorithms

Bagging

Decrease overall variance by averaging the performance of multiple estimates. Aggregate several sampling subsets of the original dataset to train different learners chosen randomly with replacement, which conforms to the core idea of bootstrap aggregation. Bagging normally uses a voting mechanism for classification (Random Forest) and averaging for regression.

Note: Remember that some learners are stable and less sensitive to training perturbations. Such learners, when combined, don’t help the general model to improve generalization performance.

Boosting

This technique matches weak learners — learners that have poor predictive power and do slightly better than random guessing — to a specific weighted subset of the original dataset. Higher weights are given to subsets that were misclassified earlier.

Learner predictions are then combined with voting mechanisms in case of classification or weighted sum for regression.

Ensemble algorithms - boosting
Credit: Intuition behind the boosting algorithm, Author: Sirakorn Lamyai

Well-known boosting algorithms 

AdaBoost

AdaBoost stands for Adaptive Boosting. The logic implemented in the algorithm is: 

  • First-round classifiers (learners) are all trained using weighted coefficients that are equal,
  • In subsequent boosting rounds the adaptive process increasingly weighs data points that were misclassified by the learners in previous rounds and decrease the weights for correctly classified ones. 

If you’re curious about the algorithm’s description, take a look at this:

Ensemble algorithms - adaboost
Credit AdaBoost algo description in pseudocode, Author: Gokhan Tur 

Gradient Boosting

Gradient Boosting uses differentiable function losses from the weak learners to generalize. At each boosting stage, the learners are used to minimize the loss function given the current model. Boosting algorithms can be used either for classification or regression. 

What is XGBoost architecture?

XGBoost stands for Extreme Gradient Boosting. It’s a parallelized and carefully optimized version of the gradient boosting algorithm. Parallelizing the whole boosting process hugely improves the training time. 

Instead of training the best possible model on the data (like in traditional methods), we train thousands of models on various subsets of the training dataset and then vote for the best-performing model.

For many cases, XGBoost is better than usual gradient boosting algorithms. The Python implementation gives access to a vast number of inner parameters to tweak for better precision and accuracy.

Some important features of XGBoost are:

  • Parallelization: The model is implemented to train with multiple CPU cores.
  • Regularization: XGBoost includes different regularization penalties to avoid overfitting. Penalty regularizations produce successful training so the model can generalize adequately.
  • Non-linearity: XGBoost can detect and learn from non-linear data patterns.
  • Cross-validation: Built-in and comes out-of-the-box.
  • Scalability: XGBoost can run distributed thanks to distributed servers and clusters like Hadoop and Spark, so you can process enormous amounts of data. It’s also available for many programming languages like C++, JAVA, Python, and Julia. 

How does the XGBoost algorithm work?

  • Consider a function or estimate . To start, we build a sequence derived from the function gradients. The equation below models a particular form of gradient descent. The represents the Loss function to minimize hence it gives the direction in which the function decreases. is the rate of change fitted to the loss function, it’s equivalent to the learning rate in gradient descent. is expected to approximate the behaviour of the loss suitably.
  • To iterate over the model and find the optimal definition we need to express the whole formula as a sequence and find an effective function that will converge to the minimum of the function. This function will serve as an error measure to help us decrease the loss and keep the performance over time. The sequence converges to the minimum of the function . This particular notation defines the error function that applies when evaluating a gradient boosting regressor. 

For more details and an in-depth look over the mathematics behind Gradients Boosting I recommend you to check this post by Krishna Kumar Mahto.

Other Gradient Boosting methods

Gradient Boosting Machine (GBM)

GBM combines predictions from multiple decision trees, and all the weak learners are decision trees. The key idea with this algorithm is that every node of those trees takes a different subset of features to select the best split. As it’s a Boosting algorithm, each new tree learns from the errors made in the previous ones.

Useful reference  -> Understanding Gradient Boosting Machines

Light Gradient Boosting Machine (LightGBM)

LightGBM can handle huge amounts of data. It’s one of the fastest algorithms for both training and prediction. It generalizes well, meaning that it can be used to solve similar problems. It scales well to large numbers of cores and has an open-source code so you can use it in your projects for free.

Useful reference -> Understanding LightGBM Parameters (and How to Tune Them)

Categorical Boosting (CatBoost)

This particular set of Gradient Boosting variants has specific abilities to handle categorical variables and data in general. The CatBoost object can handle categorical variables or numeric variables, as well as datasets with mixed types. That’s not all. It can also use unlabelled examples and explore the effect of kernel size on speed during training.

Useful reference -> CatBoost: A machine learning library to handle categorical (CAT) data automatically

Integrate XGBoost with Neptune 

Start creating your project

Install the required Neptune client libraries:

pip install neptune-client 

Install the Neptune notebook plugin to save all your work and link it with the Neptune UI platform:

pip install -U neptune-notebooks

Enable jupyter integration:

jupyter nbextension enable --py neptune-notebooks

Register on the Neptune platform, get your API key, and connect your notebook with your Neptune session:

ML-development-API

To complete the setup, import the Neptune client library in your notebook and initialize the connection calling the neptune.init() method:

import neptune
neptune.init(project_qualified_name='aymane.hachcham/XGBoostStudy)

Get your dataset ready to use

To illustrate some of the work we can do with XGBoost, I picked a dataset with factors that will determine and predict exam scores for school students. The dataset is based on data from students who studied at different school levels for four to six months, including their weekly amount of hours and how well they did.

The dataset will determine how much influence the quality of studying has on exam scores for students. The dataset contains data representing results obtained in three exams regarding several social, economical, and financial factors. 

The dataset is presented with the following features:

  • Gender
  • Race / Ethnicity
  • Parental level of Education
  • Lunch
  • Test preparation course

We can take a closer look at the distributions regarding ethnicity and gender:

  • Math, Reading and Writing Scores by Ethnicity:
import plotly.graph_objects as go

# Grouping by Ethnicity
groupe_math_eth = data_exams.groupby(['race/ethnicity'])['math score'].sum().reset_index()
groupe_reading_eth = data_exams.groupby('race/ethnicity')['reading score'].sum().reset_index()
groupe_writing_eth = data_exams.groupby('race/ethnicity')['writing score'].sum().reset_index()


fig = go.Figure(data=[
    go.Bar(name='Math Score', x=groupe_math_eth['race/ethnicity'], y=groupe_math_eth['math score']),
    go.Bar(name='Reading Score', x=groupe_reading_eth['race/ethnicity'], y=groupe_reading_eth['reading score']),
    go.Bar(name='Writing Score', x=groupe_writing_eth['race/ethnicity'], y=groupe_writing_eth['writing score'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Group C stands out by all scoring metrics. This suggests a priori that there will be differences in study performances that relate to race and ethnic origin.

XGBoost guide - plot 1
  • Math, Reading and Writing scores by Gender
# Grouping by Gender
groupe_math_gen = data_exams.groupby(['gender'])['math score'].sum().reset_index()
groupe_reading_gen = data_exams.groupby('gender')['reading score'].sum().reset_index()
groupe_writing_gen = data_exams.groupby('gender')['writing score'].sum().reset_index()

fig = go.Figure(data=[
    go.Bar(name='Math Score', x=groupe_math_gen['gender'], y=groupe_math_gen['math score']),
    go.Bar(name='Reading Score', x=groupe_reading_gen['gender'], y=groupe_reading_gen['reading score']),
    go.Bar(name='Writing Score', x=groupe_writing_gen['gender'], y=groupe_writing_gen['writing score'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

On the contrary, we find out that gender doesn’t influence the overall scoring too much in all three disciplines. It seems like gender is not a decisive factor.

XGBoost guide - plot 2

Sample your data

To sample the data, we’ll use the DMatrix object from the XGBoost Python library. The process of sampling the data into subsets is called data segregation. DMatrix is an internal data structure that helps with memory management and data optimization. We’ll split the data into training and testing subsets and then start training.

import xgboost as xgb
from sklearn.model_selection import train_test_split

features = data_exams.iloc[:,:-3]
target = data_exams.iloc[:,-3:] 

Transform Categorical variables to Numeric:

All the features are still registered as categorical values. To train the model, we need to associate to each variable a numerical code to identify it, and thus transform the 5 feature columns accordingly.

  1. First, convert object data types to string using pandas:
features['gender'].astype('string')
features['race/ethnicity'].astype('string')
features['parental level of education'].astype('string')
features['test preparation course'].astype('string')
  1. Transform the Gender into 0/1 values:
features.loc[features['gender'] == 'male', 'gender'] = 0
features.loc[features['gender'] == 'female', 'gender'] = 1
  1. Transform the parental level of education:
features.loc[features['parental level of education'] == 'high school', 'parental level of education'] = 1
features.loc[features['parental level of education'] == 'some college', 'parental level of education'] = 2
features.loc[features['parental level of education'] == 'some high school', 'parental level of education'] = 3
features.loc[features['parental level of education'] == 'associate\'s degree', 'parental level of education'] = 4
features.loc[features['parental level of education'] == 'bachelor\'s degree', 'parental level of education'] = 5
features.loc[features['parental level of education'] == 'master\'s degree', 'parental level of education'] = 6
  1. Transform the lunch values:
features.loc[features['lunch'] == 'standard', 'lunch'] = 0
features.loc[features['lunch'] == 'free/reduced', 'lunch'] = 1
  1. Transform the test preparation course:
features.loc[features['test preparation course'] == 'none', 'test preparation course'] = 0
features.loc[features['test preparation course'] == 'completed', 'test preparation course'] = 1

Separating between the data features, the columns we’ll try to use for our predictions, and data targets which are the 3 last columns representing the Math, Reading, and Writing scores obtained by these students.

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.30, random_state=123)
d_matrix_train = xgb.DMatrix(x_train, y_train, enable_categorical=True)
d_matrix_test = xgb.DMatrix(x_test, y_test, enable_categorical=True)

Connect your script to Neptune

  • Start by creating a new project, give it a name and a meaningful description 👉 read how.
  • Connect your script using your Neptune credentials 👉 read how.
import neptune.new as neptune
experiment = neptune.init(project='aymane.hachcham/XGBoost-Complete-Guide', api_token='API TOKEN')
  • Head over to your jupyter notebook and enable the Neptune extensions to get access to your project 👉 read how.

Note: I highly encourage you to take a look at the new documentation released for the new Neptune Python API. It’s well-structured and very intuitive. 

Train the model

Once we’ve correctly segregated our training and testing samples, we’re ready to start fitting the data to the mode. During the training process, we’ll try to jointly log all forthcoming evaluation results to have a real-time understanding of our model performance. 

To start logging with Neptune, we create an experiment and a list of hyperparameters that will define each model version.

import neptune
from neptunecontrib.monitoring.xgboost import neptune_callback

params = {
    'max_depth':7,
    'learning_rate':0.1,
    'objective': 'reg:squarederror',
    'num_estimators':1000, 
    'num_boost_round':1000
}

neptune.create_experiment(
    name='XGBoost-V1',
    tags=['XGBoost', 'Version1'],
    params=params
)

Start the training process using the parameters we set up earlier and DMatrix data. We also add the neptune_callback() function, which automates all the necessary work to monitor in real-time the loss and eval metrics.

watchlist = [(d_matrix_test, 'test'), (d_matrix_train, 'train')]
num_round = 20

xgb.train(params, d_matrix_train, num_round, watchlist, callbacks=[neptune_callback()])
neptune.stop()

Once your project is connected to the platform, Neptune tracks and monitors the current state of your physical resources, like the GPU, Memory, CPU, and GPU memory. It’s very convenient to always keep a closer eye on your resources while you do training.

XGBoost guide - model training

Let’s take a closer look and try to explain the XGBRegressor hyperparameters:

  • Learning_rate: Used to control and adjust the weighting of the internal model estimators. The learning_rate should always be a small value to force long-term learning. 
  • Max_depth: Indicates the depth degree of the estimators (trees in this case). Manipulate this parameter carefully, because it will cause the model to overfit. 
  • Alpha: A specific type of penalty regularization (L1) to help avoid overfitting
  • Num_estimators: The number of estimators the model will be built upon.
  • Num_boost_round: The number of boosting stages. Although num_estimators and num_boost_round remain quite the same, you should keep in mind that the num_boost_round should be re-tuned each time you update a parameter.  

If you head back to Neptune, under All metadata you’ll see the model registered and all the initial parameters logged in.

XGBoost guide - params

Once training is launched, you can supervise the train and eval metric logs by clicking on the current experiment and heading to the ‘Charts’ section. The intuitive and ergonomic UI that Neptune offers alongside the tight integration with a plethora of ML algorithms and frameworks enables you to quickly iterate and improve over different versions of your model. 

XGBoost guide - losses

Using Neptune Callback, we can access even more information about our models, like the feature importance graph and the structure of the inner estimators.

Feature importance graph 

The XGBRegressor generally classifies the order of importance of each feature used for the prediction. A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute. This can be done by computing a feature importance graph, visualizing the similarity between each feature (feature-wise or attribute-wise) within the boosted trees. 

This article explains how to compute importance scores for each feature using the GBM package in R -> Gradient Boosted Feature Selection

XGBoost guide - feature importance

We can see that Race/Ethnicity and parental level of education are the principal factors in a student’s success. 

Versioning your model

You can version and store multiple versions of your model in a binary format in Neptune. Neptune automatically saves the current version of the model once the training is finished, so you can always get back and load previous model versions to compare.

Under ‘All Metadata -> Artifacts’, you’ll find all relevant metadata that’s been stored:

XGBoost guide - artifacts

Collaborate and share work with your team

The platform also lets you cross-compare all your experiments in a seamless manner. Simply check the experiments you want and a specific view will appear that shows all required information.

You can share any Neptune experiment by just sending a link – read more here.

Note: Using the team plan, you can share all your work and projects with your teammates. 

How to hyper-tune the XGBRegressor

The most efficient way of dealing with parameter tuning when time and resources are not an issue is to run a gigantic Grid Search on all the parameters and wait for the algorithm to output the optimal solution. It’s good to do so if you’re exploiting a small or intermediate dataset. But for bigger datasets, this approach can very quickly turn into a nightmare and consume too much time and too many resources.

Tips for hyper-tuning XGBoost when dealing with huge datasets

A well known saying among data scientists goes like this: “You can make your model do wonders if your data has some signal and if it doesn’t have a signal, it doesn’t”.

The most straightforward approach I suggest when having vast amounts of training data is to try to manually research the features that have significant predictive impact. 

  • Firstly try to reduce your features. 100 or 200 features is a lot, you should try to narrow the scope of feature selection. You could also rely on SelectKBest to select the top performers according to a specific criteria in which each feature scores a K number of points and it is chosen accordingly. 
  • Bad performance can also be related to the quality assessment of your testing dataset. The test data might represent a completely different subset of data as compared to your train data. Therefore, you should try doing cross-validation so that the R-squared value on the features is confident enough and sufficiently reliable.
  • Finally, if you see that your hyperparameter tuning is still having minimal impact try to switch to more simpler regression methods like Linear, Lasso, Elastic Net, instead of sticking to XGBRegression.  

Since the data for our example isn’t that big, we can choose to go for the first option. However, since the goal here is to expose the more reliable option for model tuning that you can leverage, we’ll go for this option without hesitation. Keep in mind that if you know which hyperparameters have more impact on the data, you’ll have a much smaller scope of work.

Grid Search method 

Fortunately, XGBoost implements the scikit-learn API, so it’s very easy to use Grid Search and start rolling out the optimal results for the model based on the set of original hyperparameters.

Let’s create a range of values that each hyperparameter can take:

parameters = {
    'learning_rate': [0.1, 0.01, 0.05],
    'max_depth': range (2, 10, 1),
    'n_estimators': range(60, 220, 500, 1000),
    'num_boost_round': range(60, 250, 500, 1000)  
}

Configure your GridSearchCV. The best metric to evaluate the performance, in this case, would be the ROC AUC curve comparing the results of 10-fold cross-validation.

from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=parameters,
    scoring = 'roc_auc',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

Launch training:

grid_search.fit(x_train, y_train)

The processing steps will look something like this:

Fitting 10 folds for each of 96 candidates, totalling 960 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:   11.0s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:   40.1s
[Parallel(n_jobs=10)]: Done 430 tasks      | elapsed:  1.7min
[Parallel(n_jobs=10)]: Done 780 tasks      | elapsed:  3.1min
[Parallel(n_jobs=10)]: Done 960 out of 960 | elapsed:  4.0min 

The Best estimator can be accessed using the field best_estimator_ :

xgb_best = grid_search.best_estimator_

XGBoost pros and cons

Advantages

  • Gradient Boosting comes with an easy to read and interpret algorithm, making most of its predictions easy to handle.
  • Boosting is a resilient and robust method that prevents and cubs over-fitting quite easily
  • XGBoost performs very well on medium, small, data with subgroups and structured datasets with not too many features. 
  • It is a great approach to go for because the large majority of real-world problems involve classification and regression, two tasks where XGBoost is the reigning king. 

Disadvantages 

  • XGBoost does not perform so well on sparse and unstructured data.
  • A common thing often forgotten is that Gradient Boosting is very sensitive to outliers since every classifier is forced to fix the errors in the predecessor learners. 
  • The overall method is hardly scalable. This is because the estimators base their correctness on previous predictors, hence the procedure involves a lot of struggle to streamline. 

Conclusion

We’ve covered many aspects of Gradient Boosting, starting from a theoretical point of view to a more practical path. Now you can see how easy it is to add experiment tracking and model management to your XGBoost training and hyper-tuning with Neptune. 

As always, I’ll leave you with some useful references below, so you can expand your pool of knowledge even more and improve your coding skills. 

Stay tuned for more content!

References:


READ NEXT

How to Organize Your XGBoost Model Development Process

5 mins read | Derrick Mwiti | Posted January 11, 2021

XGBoost is a top gradient boosting library that is available in Python, Java, C++, R, and Julia. 

The library offers support for GPU training, distributed computing, parallelization, and cache optimization. Developers also love it for its execution speed, accuracy, efficiency, and usability.

However, when you are developing machine learning models in any framework, XGBoost included, you may end up trying a bunch of parameter configurations and feature versions to get a satisfactory performance. Managing all these configurations with spreadsheets and naming conventions can be a real pain in the neck. 

There are tools that can help developers organize their machine learning model development process. This article will focus on showing you how you can do that in one of the top ML experiment management tools, Neptune.

Let me show you how to add experiment management on top of your current model development setup.

Continue reading ->
Gradient Boosted Decision Trees Guide a Conceptual Explanation

Gradient Boosted Decision Trees [Guide]: a Conceptual Explanation

Read more
Random Forest Regression: When Does It Fail and Why?

Random Forest Regression: When Does It Fail and Why?

Read more
Experiment tracking Experiment management

15 Best Tools for ML Experiment Tracking and Management

Read more
A Comprehensive Guide to Ensemble Learning What Exactly Do You Need to Know

A Comprehensive Guide to Ensemble Learning: What Exactly Do You Need to Know

Read more