How to Organize Your XGBoost Machine Learning (ML) Model Development Process – Best Practices

Posted January 11, 2021
Neptune and XGBoost

XGBoost is a top gradient boosting library that is available in Python, Java, C++, R, and Julia. 

The library offers support for GPU training, distributed computing, parallelization, and cache optimization. Developers also love it for its execution speed, accuracy, efficiency, and usability.

However, when you are developing machine learning models in any framework, XGBoost included, you may end up trying a bunch of parameter configurations and feature versions to get a satisfactory performance. Managing all these configurations with spreadsheets and naming conventions can be a real pain in the neck. 

There are tools that can help developers organize their machine learning model development process. This article will focus on showing you how you can do that in one of the top ML experiment management tools, Neptune.

Let me show you how to add experiment management on top of your current model development setup.


CHECK ALSO
Neptune’s integration with XGBoost – documentation


How ML model development with XGBoost typically looks like

Get the dataset ready

Before we can train any model, we need a dataset. For this illustration, we will generate a classification dataset using Scikit-learn. In real-life you probably have some features prepared and you will be loading those. 

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100000,n_features=10, n_redundant=0, n_informative=8,random_state=1)
import pandas as pd
X = pd.DataFrame(X,columns=["F1","F2","F3","F4","F5","F6","F7","F8","F9","F10"])
y = pd.DataFrame(y,columns=["Target"])

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

Train the XGBoost model

Next, we’ll import XGBoost, set up our parameters. Since this is a binary classification, we use the logistic objective. After that, we initialize the classifier with those parameters. You can also pass in the parameters using a YAML file.

params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}
classification = xgb.XGBClassifier(**params)

The next step is to train the model with the training test. 

classification.fit(X_train, y_train)

After training, we need to save the model so that we can use it during the deployment process.

from sklearn.externals import joblib
joblib.dump(classification, 'classifier.pkl')

Next, we evaluate the model on the test set and show the classification report. 

from sklearn.metrics import classification_report

print(classification_report(predictions,y_test))

Finally, we convert the predictions obtained into a dataframe and save them as a csv file for future reference or do some deep-dive error analysis.

import pandas as pd
pd.DataFrame(predictions, columns=["Predictions"]).to_csv("predict.csv")

Now, let me show you how to get all that versioned, and easy to manage with Neptune.


RELATED ARTICLES
📌 Best Tools to Manage Machine Learning Projects
📌 Machine Learning Model Management in 2020 and Beyond – Everything That You Need to Know


Organizing ML development in Neptune

Install packages and set up Neptune 

We’ll be using Neptune in Jupyter Notebooks, so we need both the Neptune Client and Neptune Jupyter extension. 

Run the following command in your terminal to install Neptune:

pip install neptune-client

Configuring Neptune for Jupyter Notebooks is essential because it enables us to save our Notebook checkpoints to Neptune. If you are not using Notebooks you can skip this part. 

Next, set up the Notebooks extension:

pip install neptune-notebooks

Once it’s installed, you have to enable the extension so that the integration with your Jupyter Notebook can work:

jupyter nbextension enable --py neptune-notebooks

If you are working on JupyterLab you can install the extension without any problems. 

Since we’re installing packages, let’s also get the Neptune Contrib package out of the way. It contains a callback that will let us log our metrics, the model, and feature importances to Neptune while training the XGBoost model. 

pip install neptune-contrib[monitoring]

Connect your script to Neptune 

At this point, you need a free Neptune AI account to initialize Neptune in your Jupyter Notebook. For that to work, you need a Neptune API Key. 

Once you’re logged in, you can get the key by clicking your profile image. 

neptune getting started

All set? Let’s jump over to our Jupyter Notebook and do the initialization. 

The first step is to connect our Notebook to Neptune by clicking the Neptune logo.

Neptune connect

You will now be prompted to enter your API Token. Once the connection is successful you can upload your Notebook to Neptune by clicking the upload button. 

Neptune configure API

After that, we use `neptune.init` to initialize the communication and connect the current script/notebook with your project in Neptune.

import neptune

neptune.init(project_qualified_name='mwitiderrick/sandbox', api_token='YOUR_API_KEY')

In this case, I use the ‘sandbox’ project that’s created automatically when you sign up. However, you can create a new project while in the “Projects” tab. 

Neptune projects tab

Create an experiment and save hyperparameters 

The first thing we need to do to start logging to Neptune is to create an experiment. It’s a namespace to which you can log metrics, predictions, visualizations, and anything else (full list here). 

Let’s create an experiment and log model hyperparameters. 

experiment = neptune.create_experiment(name='xgb', tags=['train'], params=params)

Running neptune.create_experiment outputs a link to that experiment in Neptune. 

You can click on it to see the training process live. 

Right now, not much is logged but we can see hyperparameters in the parameters section.

The parameters tab shows the parameters used to train the XGBoost model. 

Neptune parameters tab

When working in Notebooks, once you are done running the experiment ensure that your run neptune.stop() to finish the current work (in scripts the experiment is stopped automatically). 

Create Neptune callback and pass it to `fit` 

In order to log the training metrics to Neptune, we use an out-of-the-box callback from the neptune-contrib library. It’s pretty cool because it’s the only thing we need to add at the training stage. 

With that callback set up, Neptune takes care of the rest. 

from neptunecontrib.monitoring.xgboost import neptune_callback

We train the model by calling the fit method and passing in the parameters we defined earlier, including the Neptune Callback. 

classification.fit(X_train, y_train,callbacks=[neptune_callback()],eval_set=[(X_test, y_test)])

Once the training is complete, head back to Neptune and check the logs.

While logged in, you can click the project you’re currently working on to see all your experiments.

Neptune experiments view

On the monitoring tab, the CPU and RAM usage are displayed in real-time.

Neptune monitoring tab

Clicking on a single experiment will show you the logs for that particular experiment. On the charts section, you can see the training and validation charts. 

Neptune experiment charts

With Neptune, you can also zoom in at various sections of the training to analyze them in detail. 

Neptune charts detail

The “Logs” section shows the logs that were used to generate these graphs. 

Notice the feature importance graph that we get automatically as a result of using the Neptune callback.

Neptune feature importance graph

Once you have a couple of experiments in your project, they can be compared. You do this by selecting the experiments and hitting the compare button.

Neptune compare experiments

Version dataset

Versioning your dataset hash in Neptune can also be very useful. This would enable you to track different versions of your dataset when performing your experiments. This can be done with the help of Python’s hashlib module and Neptune’s set_property function.

import hashlib
neptune.set_property('x_train_version', hashlib.md5(X_train.values).hexdigest())
neptune.set_property('y_train_version', hashlib.md5(y_train.values).hexdigest())
neptune.set_property('x_test_version', hashlib.md5(X_test.values).hexdigest())
neptune.set_property('y_test_version', hashlib.md5(y_test.values).hexdigest())

After that, you can see the versions under the details tab of your project. 

Neptune versions

Version model binary

You can also save various versions of the model to Neptune by using neptune.log_artifact() However, since we are using neptune_callback() the model is logged to Neptune after the last boosting iteration automatically.

Neptune artifacts

We can also log the model we saved earlier. 

neptune.log_artifact('classifier.pkl')
Neptune log model

Version whatever else you think you will need

Neptune also offers the ability to log other things such as model explainers and interactive charts such as ROC curves. 

Logging the explainer is done using the log_explainer function. 

from neptunecontrib.api import log_explainer, log_global_explanations
import dalex as dx

expl = dx.Explainer(classification, X, y, label="XGBoost")

log_global_explanations(expl, numerical_features=["F1","F2","F3","F4","F5","F6","F7","F8","F9","F10"])

log_explainer('explainer.pkl', expl)

After doing this the pickled explainer and charts will be available in the artifacts section of the experiment.

Neptune artifacts
Neptune artifacts 2

Organize experiments in a dashboard

Neptune allows you to choose what you want to see on the dashboard. You can add or remove columns. By clicking Manage Columns you can add new columns to your dashboard. 

Neptune columns 1

If you would like to remove a certain column, you just click the x button next to it. You can also filter by a certain column by using the up and down arrows next to it. These buttons indicate whether you are filtering in ascending or descending order. 

Neptune columns

The platform also enables you to group your experiments into views. For example, you can choose experiments with a certain tag and save those experiments as a new view. Once you do that, the new view will always be accessible to you. 

Neptune tags

Collaborate on ML experiments with your team

You can share any of your Neptune experiments by inviting your teams to collaborate. 

Neptune invite team

You can also share the project with the world by making it public. Once the project is public you can freely share the link with anyone. 

Neptune share project

Note:

When using the team plan you can share your private projects with your teammates. The Team Plan is also free for research, non-profit organizations, and Kagglers.

You can share whatever you do in the Neptune app, for example, I can share my experiment comparison by sending a link

Neptune share comparison
Neptune share comparison

Download model artifacts programmatically

With Neptune, you can download files from your experiment or even single items directly from your python code. For example, you can download a single file using the download_artifact method. We can download the model by fetching the experiment object and downloading all the files from that experiment. In this case, we download the classifier that we had uploaded earlier. The classifier is stored in a model folder in our current working directory. 

project = neptune.init('mwitiderrick/sandbox',api_token='YOUR_TOKEN',
)
my_exp = project.get_experiments(id='SAN-21')[0]
experiment.download_artifact("classifier.pkl","model"))

This is useful when you want to operationalize your models and fetch them directly from where your experiment repo. But putting models in production is a story for another day 🙂 

Conclusion

Hopefully, this has shown you how easy it is to add experiment tracking and model versioning to your XGBoost training scripts using Neptune. 

Specifically, we covered how to:

  • set up Neptune 
  • use Neptune Callbacks to log our XGBoost training session
  • analyze and compare experiments in Neptune
  • version various items on Neptune 
  • collaborate with team members 
  • download your artifacts from Neptune

Hopefully with all this information developing XGBoost models will now be cleaner and more manageable. 

Thanks for reading!

Data Scientist

NEXT STEPS

Get started with Neptune in 5 minutes

If you are looking for an experiment tracking tool you may want to take a look at Neptune. 

It takes literally 5 minutes to set up and as one of our happy users said:

“Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison.” – Edward Dixon, Data Scientist @intel

To get started follow these 4 simple steps. 

Step 1

Install the client library.

pip install neptune-client

Step 2

Connect to the tool by adding a snippet to your training code. 

For example:

import neptune

neptune.init(...) # credentials
neptune.create_experiment() # start logger

Step 3

Specify what you want to log:

neptune.log_metric('accuracy', 0.92)

for prediction_image in worst_predictions:
    neptune.log_image('worst predictions', prediction_image)

Step 4

Run your experiment as you normally would:

python train.py

And that’s it!

Your experiment is logged to a central experiment database and displayed in the experiment dashboard, where you can search, compare, and drill down to whatever information you need.

Get your free account ->