How to Organize Your XGBoost Machine Learning (ML) Model Development Process: Best Practices
XGBoost is a top gradient boosting library that is available in Python, Java, C++, R, and Julia.
The library offers support for GPU training, distributed computing, parallelization, and cache optimization. Developers also love it for its execution speed, accuracy, efficiency, and usability.
However, when you are developing machine learning models in any framework, XGBoost included, you may end up trying a bunch of parameter configurations and feature versions to get a satisfactory performance. Managing all these configurations with spreadsheets and naming conventions can be a real pain in the neck.
There are tools that can help developers organize their machine learning model development process. This article will focus on showing you how you can do that in one of the top ML experiment management tools, Neptune.
Let me show you how to add experiment management on top of your current model development setup.
Learn more
How to track XGBoost model training: Neptune + XGBoost integration
How ML model development with XGBoost typically looks like
Get the dataset ready
Before we can train any model, we need a dataset. For this illustration, we will generate a classification dataset using Scikit-learn. In real-life you probably have some features prepared and you will be loading those.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=100000,n_features=10, n_redundant=0, n_informative=8,random_state=1)
import pandas as pd
X = pd.DataFrame(X,columns=["F1","F2","F3","F4","F5","F6","F7","F8","F9","F10"])
y = pd.DataFrame(y,columns=["Target"])
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
Train the XGBoost model
Next, we’ll import XGBoost, set up our parameters. Since this is a binary classification, we use the logistic
objective. After that, we initialize the classifier with those parameters. You can also pass in the parameters using a YAML file.
params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}
classification = xgb.XGBClassifier(**params)
The next step is to train the model with the training test.
classification.fit(X_train, y_train)
After training, we need to save the model so that we can use it during the deployment process.
from sklearn.externals import joblib
joblib.dump(classification, 'classifier.pkl')
Next, we evaluate the model on the test set and show the classification report.
from sklearn.metrics import classification_report
print(classification_report(predictions,y_test))
Finally, we convert the predictions obtained into a dataframe and save them as a csv file for future reference or do some deep-dive error analysis.
import pandas as pd
pd.DataFrame(predictions, columns=["Predictions"]).to_csv("predict.csv")
Now, let me show you how to get all that versioned, and easy to manage with Neptune.
RELATED ARTICLES
Best Tools to Manage Machine Learning Projects
Machine Learning Model Management in 2020 and Beyond – Everything That You Need to Know
Organizing ML development in Neptune
Install packages and set up Neptune
We’ll be using Neptune in Jupyter Notebooks, so we need both the Neptune Client and Neptune Jupyter extension.
Run the following command in your terminal to install Neptune:
pip install neptune-client
Configuring Neptune for Jupyter Notebooks is essential because it enables us to save our Notebook checkpoints to Neptune. If you are not using Notebooks you can skip this part.
Next, set up the Notebooks extension:
pip install neptune-notebooks
Once it’s installed, you have to enable the extension so that the integration with your Jupyter Notebook can work:
jupyter nbextension enable --py neptune-notebooks
If you are working on JupyterLab you can install the extension without any problems.
Since we’re installing packages, let’s also get the Neptune Contrib package out of the way. It contains a callback that will let us log our metrics, the model, and feature importances to Neptune while training the XGBoost model.
pip install neptune-contrib[monitoring]
Connect your script to Neptune
At this point, you need a free Neptune AI account to initialize Neptune in your Jupyter Notebook. For that to work, you need a Neptune API Key.
Once you’re logged in, you can get the key by clicking your profile image.

All set? Let’s jump over to our Jupyter Notebook and do the initialization.
The first step is to connect our Notebook to Neptune by clicking the Neptune logo.

You will now be prompted to enter your API Token. Once the connection is successful you can upload your Notebook to Neptune by clicking the upload button.

After that, we use `neptune.init` to initialize the communication and connect the current script/notebook with your project in Neptune.
import neptune
neptune.init(project_qualified_name='mwitiderrick/sandbox', api_token='YOUR_API_KEY')
In this case, I use the ‘sandbox’ project that’s created automatically when you sign up. However, you can create a new project while in the “Projects” tab.

Create an experiment and save hyperparameters
The first thing we need to do to start logging to Neptune is to create an experiment. It’s a namespace to which you can log metrics, predictions, visualizations, and anything else (see the full list of metadata types you can log and display in Neptune ).
Let’s create an experiment and log model hyperparameters.
experiment = neptune.create_experiment(name='xgb', tags=['train'], params=params)
Running neptune.create_experiment
outputs a link to that experiment in Neptune.
You can click on it to see the training process live.
Right now, not much is logged but we can see hyperparameters in the parameters section.
The parameters tab shows the parameters used to train the XGBoost model.

When working in Notebooks, once you are done running the experiment ensure that your run neptune.stop()
to finish the current work (in scripts the experiment is stopped automatically).
Create Neptune callback and pass it to `fit`
In order to log the training metrics to Neptune, we use an out-of-the-box callback from the neptune-contrib library. It’s pretty cool because it’s the only thing we need to add at the training stage.
With that callback set up, Neptune takes care of the rest.
from neptunecontrib.monitoring.xgboost import neptune_callback
We train the model by calling the fit
method and passing in the parameters we defined earlier, including the Neptune Callback.
classification.fit(X_train, y_train,callbacks=[neptune_callback()],eval_set=[(X_test, y_test)])
Once the training is complete, head back to Neptune and check the logs.
While logged in, you can click the project you’re currently working on to see all your experiments.

On the monitoring tab, the CPU and RAM usage are displayed in real-time.

Clicking on a single experiment will show you the logs for that particular experiment. On the charts section, you can see the training and validation charts.

With Neptune, you can also zoom in at various sections of the training to analyze them in detail.

The “Logs” section shows the logs that were used to generate these graphs.
Notice the feature importance graph that we get automatically as a result of using the Neptune callback.

Once you have a couple of experiments in your project, they can be compared. You do this by selecting the experiments and hitting the compare button.

Version dataset
Versioning your dataset hash in Neptune can also be very useful. This would enable you to track different versions of your dataset when performing your experiments. This can be done with the help of Python’s hashlib
module and Neptune’s set_property
function.
import hashlib
neptune.set_property('x_train_version', hashlib.md5(X_train.values).hexdigest())
neptune.set_property('y_train_version', hashlib.md5(y_train.values).hexdigest())
neptune.set_property('x_test_version', hashlib.md5(X_test.values).hexdigest())
neptune.set_property('y_test_version', hashlib.md5(y_test.values).hexdigest())
After that, you can see the versions under the details tab of your project.

Version model binary
You can also save various versions of the model to Neptune by using neptune.log_artifact()
However, since we are using neptune_callback()
the model is logged to Neptune after the last boosting iteration automatically.

We can also log the model we saved earlier.
neptune.log_artifact('classifier.pkl')

Version whatever else you think you will need
Neptune also offers the ability to log other things such as model explainers and interactive charts such as ROC curves.
Logging the explainer is done using the log_explainer
function.
from neptunecontrib.api import log_explainer, log_global_explanations
import dalex as dx
expl = dx.Explainer(classification, X, y, label="XGBoost")
log_global_explanations(expl, numerical_features=["F1","F2","F3","F4","F5","F6","F7","F8","F9","F10"])
log_explainer('explainer.pkl', expl)
After doing this the pickled explainer and charts will be available in the artifacts section of the experiment.


Organize experiments in a dashboard
Neptune allows you to choose what you want to see on the dashboard. You can add or remove columns. By clicking Manage Columns you can add new columns to your dashboard.

If you would like to remove a certain column, you just click the x button next to it. You can also filter by a certain column by using the up and down arrows next to it. These buttons indicate whether you are filtering in ascending or descending order.

The platform also enables you to group your experiments into views. For example, you can choose experiments with a certain tag and save those experiments as a new view. Once you do that, the new view will always be accessible to you.

Collaborate on ML experiments with your team
You can share any of your Neptune experiments by inviting your teams to collaborate.

You can also share the project with the world by making it public. Once the project is public you can freely share the link with anyone.

Note:
When using the team plan you can share your private projects with your teammates. The Team Plan is also free for research, non-profit organizations, and Kagglers.
You can share whatever you do in the Neptune app, for example, I can share my experiment comparison by sending a link.


Download model artifacts programmatically
With Neptune, you can download files from your experiment or even single items directly from your python code. For example, you can download a single file using the download_artifact
method. We can download the model by fetching the experiment object and downloading all the files from that experiment. In this case, we download the classifier that we had uploaded earlier. The classifier is stored in a model folder in our current working directory.
project = neptune.init('mwitiderrick/sandbox',api_token='YOUR_TOKEN',
)
my_exp = project.get_experiments(id='SAN-21')[0]
experiment.download_artifact("classifier.pkl","model"))
This is useful when you want to operationalize your models and fetch them directly from where your experiment repo. But putting models in production is a story for another day 🙂
Conclusion
Hopefully, this has shown you how easy it is to add experiment tracking and model versioning to your XGBoost training scripts using Neptune.
Specifically, we covered how to:
- set up Neptune
- use Neptune Callbacks to log our XGBoost training session
- analyze and compare experiments in Neptune
- version various items on Neptune
- collaborate with team members
- download your artifacts from Neptune
Hopefully with all this information developing XGBoost models will now be cleaner and more manageable.
Thanks for reading!