Blog » ML Tools » How to Organize Your XGBoost Machine Learning (ML) Model Development Process – Best Practices

How to Organize Your XGBoost Machine Learning (ML) Model Development Process – Best Practices

XGBoost is a top gradient boosting library that is available in Python, Java, C++, R, and Julia. 

The library offers support for GPU training, distributed computing, parallelization, and cache optimization. Developers also love it for its execution speed, accuracy, efficiency, and usability.

However, when you are developing machine learning models in any framework, XGBoost included, you may end up trying a bunch of parameter configurations and feature versions to get a satisfactory performance. Managing all these configurations with spreadsheets and naming conventions can be a real pain in the neck. 

There are tools that can help developers organize their machine learning model development process. This article will focus on showing you how you can do that in one of the top ML experiment management tools, Neptune.

Let me show you how to add experiment management on top of your current model development setup.


CHECK ALSO
Neptune’s integration with XGBoost – documentation


How ML model development with XGBoost typically looks like

Get the dataset ready

Before we can train any model, we need a dataset. For this illustration, we will generate a classification dataset using Scikit-learn. In real-life you probably have some features prepared and you will be loading those. 

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100000,n_features=10, n_redundant=0, n_informative=8,random_state=1)
import pandas as pd
X = pd.DataFrame(X,columns=["F1","F2","F3","F4","F5","F6","F7","F8","F9","F10"])
y = pd.DataFrame(y,columns=["Target"])

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

Train the XGBoost model

Next, we’ll import XGBoost, set up our parameters. Since this is a binary classification, we use the logistic objective. After that, we initialize the classifier with those parameters. You can also pass in the parameters using a YAML file.

params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}
classification = xgb.XGBClassifier(**params)

The next step is to train the model with the training test. 

classification.fit(X_train, y_train)

After training, we need to save the model so that we can use it during the deployment process.

from sklearn.externals import joblib
joblib.dump(classification, 'classifier.pkl')

Next, we evaluate the model on the test set and show the classification report. 

from sklearn.metrics import classification_report

print(classification_report(predictions,y_test))

Finally, we convert the predictions obtained into a dataframe and save them as a csv file for future reference or do some deep-dive error analysis.

import pandas as pd
pd.DataFrame(predictions, columns=["Predictions"]).to_csv("predict.csv")

Now, let me show you how to get all that versioned, and easy to manage with Neptune.


RELATED ARTICLES
📌 Best Tools to Manage Machine Learning Projects
📌 Machine Learning Model Management in 2020 and Beyond – Everything That You Need to Know


Organizing ML development in Neptune

Install packages and set up Neptune 

We’ll be using Neptune in Jupyter Notebooks, so we need both the Neptune Client and Neptune Jupyter extension. 

Run the following command in your terminal to install Neptune:

pip install neptune-client

Configuring Neptune for Jupyter Notebooks is essential because it enables us to save our Notebook checkpoints to Neptune. If you are not using Notebooks you can skip this part. 

Next, set up the Notebooks extension:

pip install neptune-notebooks

Once it’s installed, you have to enable the extension so that the integration with your Jupyter Notebook can work:

jupyter nbextension enable --py neptune-notebooks

If you are working on JupyterLab you can install the extension without any problems. 

Since we’re installing packages, let’s also get the Neptune Contrib package out of the way. It contains a callback that will let us log our metrics, the model, and feature importances to Neptune while training the XGBoost model. 

pip install neptune-contrib[monitoring]

Connect your script to Neptune 

At this point, you need a free Neptune AI account to initialize Neptune in your Jupyter Notebook. For that to work, you need a Neptune API Key. 

Once you’re logged in, you can get the key by clicking your profile image. 

neptune getting started

All set? Let’s jump over to our Jupyter Notebook and do the initialization. 

The first step is to connect our Notebook to Neptune by clicking the Neptune logo.

Neptune connect

You will now be prompted to enter your API Token. Once the connection is successful you can upload your Notebook to Neptune by clicking the upload button. 

Neptune configure API

After that, we use `neptune.init` to initialize the communication and connect the current script/notebook with your project in Neptune.

import neptune

neptune.init(project_qualified_name='mwitiderrick/sandbox', api_token='YOUR_API_KEY')

In this case, I use the ‘sandbox’ project that’s created automatically when you sign up. However, you can create a new project while in the “Projects” tab. 

Neptune projects tab

Create an experiment and save hyperparameters 

The first thing we need to do to start logging to Neptune is to create an experiment. It’s a namespace to which you can log metrics, predictions, visualizations, and anything else (full list here). 

Let’s create an experiment and log model hyperparameters. 

experiment = neptune.create_experiment(name='xgb', tags=['train'], params=params)

Running neptune.create_experiment outputs a link to that experiment in Neptune. 

You can click on it to see the training process live. 

Right now, not much is logged but we can see hyperparameters in the parameters section.

The parameters tab shows the parameters used to train the XGBoost model. 

Neptune parameters tab

When working in Notebooks, once you are done running the experiment ensure that your run neptune.stop() to finish the current work (in scripts the experiment is stopped automatically). 

Create Neptune callback and pass it to `fit` 

In order to log the training metrics to Neptune, we use an out-of-the-box callback from the neptune-contrib library. It’s pretty cool because it’s the only thing we need to add at the training stage. 

With that callback set up, Neptune takes care of the rest. 

from neptunecontrib.monitoring.xgboost import neptune_callback

We train the model by calling the fit method and passing in the parameters we defined earlier, including the Neptune Callback. 

classification.fit(X_train, y_train,callbacks=[neptune_callback()],eval_set=[(X_test, y_test)])

Once the training is complete, head back to Neptune and check the logs.

While logged in, you can click the project you’re currently working on to see all your experiments.

Neptune experiments view

On the monitoring tab, the CPU and RAM usage are displayed in real-time.

Neptune monitoring tab

Clicking on a single experiment will show you the logs for that particular experiment. On the charts section, you can see the training and validation charts. 

Neptune experiment charts

With Neptune, you can also zoom in at various sections of the training to analyze them in detail. 

Neptune charts detail

The “Logs” section shows the logs that were used to generate these graphs. 

Notice the feature importance graph that we get automatically as a result of using the Neptune callback.

Neptune feature importance graph

Once you have a couple of experiments in your project, they can be compared. You do this by selecting the experiments and hitting the compare button.

Neptune compare experiments

Version dataset

Versioning your dataset hash in Neptune can also be very useful. This would enable you to track different versions of your dataset when performing your experiments. This can be done with the help of Python’s hashlib module and Neptune’s set_property function.

import hashlib
neptune.set_property('x_train_version', hashlib.md5(X_train.values).hexdigest())
neptune.set_property('y_train_version', hashlib.md5(y_train.values).hexdigest())
neptune.set_property('x_test_version', hashlib.md5(X_test.values).hexdigest())
neptune.set_property('y_test_version', hashlib.md5(y_test.values).hexdigest())

After that, you can see the versions under the details tab of your project. 

Neptune versions

Version model binary

You can also save various versions of the model to Neptune by using neptune.log_artifact() However, since we are using neptune_callback() the model is logged to Neptune after the last boosting iteration automatically.

Neptune artifacts

We can also log the model we saved earlier. 

neptune.log_artifact('classifier.pkl')
Neptune log model

Version whatever else you think you will need

Neptune also offers the ability to log other things such as model explainers and interactive charts such as ROC curves. 

Logging the explainer is done using the log_explainer function. 

from neptunecontrib.api import log_explainer, log_global_explanations
import dalex as dx

expl = dx.Explainer(classification, X, y, label="XGBoost")

log_global_explanations(expl, numerical_features=["F1","F2","F3","F4","F5","F6","F7","F8","F9","F10"])

log_explainer('explainer.pkl', expl)

After doing this the pickled explainer and charts will be available in the artifacts section of the experiment.

Neptune artifacts
Neptune artifacts 2

Organize experiments in a dashboard

Neptune allows you to choose what you want to see on the dashboard. You can add or remove columns. By clicking Manage Columns you can add new columns to your dashboard. 

Neptune columns 1

If you would like to remove a certain column, you just click the x button next to it. You can also filter by a certain column by using the up and down arrows next to it. These buttons indicate whether you are filtering in ascending or descending order. 

Neptune columns

The platform also enables you to group your experiments into views. For example, you can choose experiments with a certain tag and save those experiments as a new view. Once you do that, the new view will always be accessible to you. 

Neptune tags

Collaborate on ML experiments with your team

You can share any of your Neptune experiments by inviting your teams to collaborate. 

Neptune invite team

You can also share the project with the world by making it public. Once the project is public you can freely share the link with anyone. 

Neptune share project

Note:

When using the team plan you can share your private projects with your teammates. The Team Plan is also free for research, non-profit organizations, and Kagglers.

You can share whatever you do in the Neptune app, for example, I can share my experiment comparison by sending a link

Neptune share comparison
Neptune share comparison

Download model artifacts programmatically

With Neptune, you can download files from your experiment or even single items directly from your python code. For example, you can download a single file using the download_artifact method. We can download the model by fetching the experiment object and downloading all the files from that experiment. In this case, we download the classifier that we had uploaded earlier. The classifier is stored in a model folder in our current working directory. 

project = neptune.init('mwitiderrick/sandbox',api_token='YOUR_TOKEN',
)
my_exp = project.get_experiments(id='SAN-21')[0]
experiment.download_artifact("classifier.pkl","model"))

This is useful when you want to operationalize your models and fetch them directly from where your experiment repo. But putting models in production is a story for another day 🙂 

Conclusion

Hopefully, this has shown you how easy it is to add experiment tracking and model versioning to your XGBoost training scripts using Neptune. 

Specifically, we covered how to:

  • set up Neptune 
  • use Neptune Callbacks to log our XGBoost training session
  • analyze and compare experiments in Neptune
  • version various items on Neptune 
  • collaborate with team members 
  • download your artifacts from Neptune

Hopefully with all this information developing XGBoost models will now be cleaner and more manageable. 

Thanks for reading!


READ NEXT

ML Metadata Store: What It Is, Why It Matters, and How to Implement It

13 mins read | Author Jakub Czakon | Updated August 13th, 2021

Most people who find this page want to improve their model-building process.

But the problems they have with storing and managing ML model metadata are different.

For some, it is messy experimentation that is the issue.

Others have already deployed the first models to production, but they don’t know how those models were created or which data was used.

Some people already have many models in production, but orchestrating model A/B testing, switching challengers and champions, or triggering, testing, and monitoring re-training pipelines is not great.

If you see yourself in one of those groups, or somewhere in between, I can tell you that ML metadata store can help with all of those things and then some. 

You may need to connect it to other MLOps tools or your CI/CD pipelines, but it will simplify managing models in most workflows. 

…but so do experiment tracking, model registry, model store, model catalog, and other model-related animals.

So what is an ML metadata store exactly, how is it different from those other model things, and how can it help you build and deploy models with more confidence?

This is what this article is about.

Also, if you are one of those people who would rather play around with things to see what they are, you can check out this example project in Neptune ML metadata store

But first…

Metadata management and what is ML metadata anyway? 

Before we dive into the ML metadata store, I should probably tell you what I mean by “machine learning metadata”.

When you do machine learning, there is always a model involved. It is just what machine learning is. 

It could be a classic, supervised model like a lightGBM classifier, a reinforcement learning agent, a bayesian optimization algorithm, or anything else really.

But it will take some data, run it through some numbers and output a decision. 

… and it takes a lot of work to deliver it into production. 

Continue reading ->

How to Organize Your LightGBM ML Model Development Process – Examples of Best Practices

Read more

A Complete Guide to Monitoring ML Experiments Live in Neptune

Read more
Organize Deep Learning projects

How to Organize Deep Learning Projects – Examples of Best Practices

Read more
Switching from spreadsheets

Switching from Spreadsheets to Neptune.ai and How It Pushed My Model Building Process to the Next Level

Read more