Blog » ML Experiment Tracking » How to Organize Your ML Development in an Efficient Way

How to Organize Your ML Development in an Efficient Way

One major issue that every data scientist and ML practitioner will eventually encounter is workflow management. Testing different scenarios and use cases, logging information and details, sharing and comparing results from a particular set of samples, visualizing the data, keeping track of insights. These are key components of data science workflow management. They help business and enable you to scale any data science project.

Data scientists know well that testing one version of an ML algorithm is not enough. Our field strongly relies on empiricism, so we need to test and compare multiple versions of the same algorithm with different hyperparameter tuning, and feature selection.

All of this generates metadata, which needs to be stored properly. To do this, I use a platform that can manage all that stuff for me – Neptune. It comes with a complete client library that you can seamlessly integrate into your code. They also give you access to a web-based UI where all your data is logged and available. 

To give you a tour of what Neptune has to offer, I simulated a real use-case scenario with a prepared online dataset. We’ll be running different analytics and ML processes to see how well Neptune can support you in daily work. 

Setting up Neptune environment 

To quickly enable you to start integrating Neptune into all project aspects, it might be useful to know how to install the packages and libraries, and how to connect your Jupyter notebook to your Neptune account. 

First, let’s create a conda virtual environment where we’ll be installing all the required neptune libraries:

conda create --name neptune python=3.6

Install neptune-client library:

pip  install neptune-client

Install Neptune Notebooks to save all our work to Neptune’s web client:

pip install -U neptune-notebooks

Enable jupyter integration with the following extension:

jupyter nbextension enable --py neptune-notebooks

Get your API key, and connect your notebook with your Neptune session:

ML development - API

Once you’ve successfully connected your notebook, you’ll need to create a personal project where all your experiments will live:

ML development - new project

To complete the setup, import the neptune client library in your notebook, and initialize the connection calling the neptune.init() method:

import neptune
neptune.init(project_qualified_name='aymane.hachcham/CaseStudyOnlineRetail')

You can also check a video made by Kamil, a former data scientist at Neptune AI, which thoroughly explains the previous details.

Note: I include code where it’s most instructive, if you want to check the full code version and the notebooks, feel free to visit my Github repo — Neptune-Retail

Exploring the dataset

We’ll be taking a look at an online retail dataset, publicly available at Kaggle. The dataset records various customers from all around the world that use an online selling platform. Each record informs about an order to purchase a specific product. 

The dataset appears as follows:

ML development - dataset

To start loading the dataset, I created a small python DataManager class to download the CSV file, extract the main features and transform them into a usable pandas dataframe:

class DataETLManager:
    def __init__(self, root_dir: str, csv_file: str):
        if os.path.exists(root_dir):
            if csv_file.endswith('.csv'):
                self.csv_file = os.path.join(root_dir, csv_file)
            else:
                logging.error('The file is not in csv format')
                exit(1)
        else:
            logging.error('The root dir path does not exist')
            exit(1)

        self.retail_df = pd.read_csv(self.csv_file, sep=',', encoding='ISO-8859-1')

    def extract_data(self):
        return self.retail_df

    def fetch_columns(self):
        return self.retail_df.columns.tolist()

    def data_description(self):
        return self.retail_df.describe()

    def fetch_categorical(self, categorical=False):
        if categorical:
            categorical_columns = list(set(self.retail_df.columns) - set(self.retail_df._get_numerical_data().columns))
            categorical_df = self.retail_df[categorical_columns]
            return categorical_df
        else:
            non_categorical = list(set(self.retail_df._get_numerical_data().columns))
            return self.retail_df[non_categorical]

    def transform_data(self):
        data = self.retail_df

        # Checking and eliminating redundant information:
        data.drop_duplicates(keep='last', inplace=True)

        # Fill null Values:
        data['InvoiceNo'].fillna(value=0, inplace=True)
        data['Description'].fillna(value='No Description', inplace=True)
        data['StockCode'].fillna(value='----', inplace=True)
        data['Quantity'].fillna(value=0, inplace=True)
        data['InvoiceDate'].fillna(value='00/00/0000 00:00', inplace=True)
        data['UnitPrice'].fillna(value=0.00, inplace=True)

        data['CustomerID'].fillna(value=0, inplace=True)
        data['Country'].fillna(value='None', inplace=True)

        # Format value columns:
        data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

        self.data_transfomed = data

The important columns that we can leverage to start building our internal core metrics are: 

  • The InvoiceDate
  • Quantity
  • UnitPrice
  • CustomerID
  • Country

Start by loading the dataset using the DataETLManager:

etl_manager = DataETLManager(root_dir='./Data', csv_file='OnlineRetail.csv')
etl_manager.extract_data()
etl_manager.transform_data()

dataset = etl_manager.data_transfomed

For a retail business, the core value relies on the revenue the platform generates through customer orders. We can form a monthly revenue combining the UnitPrice with the Quantity, and aggregating those by the InvoiceDate:

dataset['Profit'] = dataset['Quantity'] * dataset['UnitPrice']
revenue = dataset.groupby(['InvoiceDate'])['Profit'].sum().reset_index()

We could also visualize how the revenue evolves across the months by plotting the following chart:

import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.offline as pyoff

pyoff.init_notebook_mode()

data = go.Scatter(
    x=revenuePerYear['InvoiceDate'],
    y=revenuePerYear['Profit']
)

layout = go.Layout(
    xaxis={"type": "category"},
    title='Monthly Revenue'
)

fig = go.Figure(data, layout)
pyoff.iplot(fig)
ML development - graph

Since we’re mainly targeting customers, one metric that should be worth attention is the number of active customers that our platform retains. We will conduct our experiments exclusively targeting UK customers, as they constitute the majority of the data sample.

ML development - chart customers

To study active customer retention we need to check how much customer orders were made through each month:

uk_customers = dataset.query("Country=='United Kingdom'").reset_index(drop=True)
activeCustomers = dataset.groupby(['InvoiceDate'])['CustomerID'].nunique().reset_index()

The distribution appears to be quite monotonic with a peak in November 2011.

For our case study, we would like to properly segment those customers. This way, we could efficiently manage the portfolio and dissect the different levels of value each group actually offers. 

We should also keep in mind that as the business grows in size, it won’t be possible to have an intuition about each and every customer. At that stage, human judgments about which customers to pursue won’t work, and the business will have to use a data-driven approach to build a proper strategy.

In the next section, we’ll be digging deeper into the different metrics and analysis that we can leverage to appropriately segment our customer base. 

ML development - chart users

READ LATER
Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools


Extract metrics and run analytics on the data

In this section we’ll thoroughly analyse the data. We want to segment the whole customer base according to financial criteria. At the end of this section, we should be able to profile and know our customer purchase behavior. 

As we’ve already initialized our project in Neptune, we’ll kick off our first experiment logging the statistics we’ll be pulling out throughout this section. You can think of a Neptune experiment as  a namespace to which you can log metrics, predictions, visualizations, and anything else you might need. 

Start by initializing the parameters for this experiment, and call the create_experiment() method.

params = {
    'n_clusters':4,
    'max_iterations': 1000,
    'first_metric': 'Recency',
    'second_metric': 'Frequency',
    'third_metric': 'Monetary Value',
    'users': 'UK'
}

neptune.create_experiment(
    name='KMeans-UK-Users',
    tags=['KMeans-UK'],
    params=params
)

Once you run the notebook cell, you can head to the website. If you open the experiment we’ve just created, under Parameters you will find the values properly logged and ready to track further actions.

ML development - parameters

In order to segment our customer base according to profitability and growth potential, we’ll be focusing on three main factors that eventually will shape our customer financial behavior. This criteria relies on three factors that constitute the so called RFM Score:

  • Recency of use: A metric to monitor how recent the user activity is
  • Frequency of use: How often do users purchase products on the platform
  • Monetary Value: Literally, how profitable they are

First we need to elaborate the metrics from the dataset. Then, we’ll perform clustering on those data points such that we can group them by similarity within different categories from highly to less valuable customers. Insights from customer segmentation are used to develop tailor-made marketing campaigns and for designing marketing strategy. 

For this task, K-Means clustering algorithm remains a very powerful tool. It’s simplicity of use along with performance add up to a perfect balance for our use case.

For a detailed explanation on how K-Means works, I recommend this article that perfectly does the job: K-Means Clustering — Explained

RFM Score

The idea is to measure how many days since the last purchase, thus measuring the number of days of recorded inactivity on the platform. We can calculate it as the max purchase date for all customers minus the overall max date within that range. 
Create the customer data frame we’ll be working on:

customers = pd.DataFrame(dataset['CustomerID'].unique())
customers.columns = ['CustomerID']

Aggregate the Max Invoice Date:

## Recency ##
aggregatR = {'InvoiceDate': 'max'}
customers['LastPurchaseDate']=dataset.groupby(['CustomerID'],as_index=False).agg(aggregatR)['InvoiceDate']

Generate the Recency Score:

# Generating R Score
customers['Recency'] = (customers['LastPurchaseDate'].max() - customers['LastPurchaseDate']).dt.days

Customer Recency:

ML development - table customers

As we have the corresponding table, it would be a good idea to log it to our Neptune experiment

To do so, we can call the method neptune.log_table() as follows:

from neptunecontrib.api import log_table
log_table('Recency English Users', recency_UK)
ML development - neptune log table

Now you can proceed to apply K-Means to cluster our Recency distribution. Before that, we need to define the number of clusters that will best suit our needs. One way to do it is the Elbow method. The Elbow Method simply tells the optimal cluster number for optimal inertia.

K-means_metrics = {}

for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(customers['Recency'])
    customers["clusters"] = kmeans.labels_
    k-means_metrics[k] = k means.inertia_

Let’s plot the values in Neptune, se we can check in thorough detail how the curve evolves:

for val in kmeans_metrics.values():
    neptune.log_metric('Kmeans_Intertia_Values', val)

Neptune automatically logs the values in the Logs section and generates a graph chart accordingly.

ML development - neptune chart

According to the graph, the best optimal cluster number is 4. So we’ll proceed using 4 clusters for the three metrics.

K-Means for Recency:

kmeans = KMeans(n_clusters=4)
kmeans.fit(customers[['Recency']])
customers['RecencyCluster'] = kmeans.predict(customers[['Recency']])

Let’s log the Recency Distribution and the predicted clusters.

# Logging the obtained clusters:
for cluster in customers['RecencyCluster']:
    neptune.log_metric('UK Recency Clusters', cluster)
    
# Logging the recency distribution:    
for rec in customers['Recency']:
    neptune.log_metric('Recency in days', rec)
ML development - neptune metrics

If you zoom closely in the Recency days graph, you’ll notice that the values range between 50 and 280 days. 

ML development - recency

We can check more information about the clusters distribution by taking a look a some general statistics:

ML development - recency

We can notice that customers in cluster 2 are more recent than those in cluster 1. 

Let’s advance our investigations by computing the other clusters for frequency and Monetary Value respectively. We’ll try to have a more high-level comparison between the three metrics.

Aggregate the number of orders by Customer:

customers = pd.DataFrame(dataset['CustomerID'].unique())
customers.columns = ['CustomerID']

## Frequency ##
aggregatF = {'InvoiceDate': 'count'}
freq = dataset.groupby('CustomerID', as_index=False).agg(aggregatF)
customers = pd.merge(customers, freq, on='CustomerID')

K-Means for Frequency Score:

kmeans = KMeans(n_clusters=4)
kmeans.fit(customers[['Frequency']])
customers['FrequencyCluster'] = kmeans.predict(customers[['Frequency']]

When combining with the previous frames we obtain the following table:

ML development - table freq rec

Aggregate the  sum of profit generated by each customer:

## MonetaryValue ##
dataset['Profit'] = dataset['UnitPrice'] * dataset['Quantity']
aggregatMV = {'Profit': 'sum'}
mv = dataset.groupby('CustomerID', as_index=False).agg(aggregatMV)
customers = pd.merge(customers, mv, on='CustomerID')

customers.columns = ['CustomerID', 'lastPurchase', 'Recency', 'Frequency', 'MonetaryValue']
ML development - table

Then we group all the metrics together, to have a general overview.

K-Means for Monetary Value:

kmeans = KMeans(n_clusters=4)
kmeans.fit(customers[['MonetaryValue']])
customers['MonetaryCluster'] = kmeans.predict(customers[['MonetaryValue']])

To have a general RFM Score that takes into consideration all the values we’ve just gathered, we need to sum up the different clusters in a unique Overall Score. We then segment each customer portion as per the ranges values obtained. 

Three segments:

  • High Value: Scores from 0-2
  • Mid Value: Scores from 3-6
  • High Value: Score from 6-9
# Forming the RFM Overall Score:
customers['RFMScore'] = customers['RecencyCluster'] + customers['FrequencyCluster'] + customers['MonetaryCluster']
customers['UserSegment'] = 'Low'

# User Classification regarding the RFM Score:
customers.loc[customers['RFMScore'] <= 2, 'UserSegment'] = 'Low'
customers.loc[customers['RFMScore'] > 2, 'UserSegment'] = 'Mid'
customers.loc[customers['RFMScore'] > 5, 'UserSegment'] = 'High'
ML development - overall score

The best part comes when we plot clusters and visualize how they’re distributed, comparing the Frequency and Recency metric with the Monetary Value that they generate.

ML development - RFM Segmentation
RFM Segmentation | Source: Customer Segmentation

Both metrics clearly indicate that recent and frequent the users are more profitable. So, we should improve retention for the high value users (in red), and make a decision based on that criteria. Also, by improving the user retention rate, we immediately impact their frequency and recency on the platform. This means that we should also operate on user engagement.  

Organizing ML development in Neptune

In this section we’ll take advantage of one excellent feature that Neptune offers, which is ML integrations. In our case we’ll be closely looking to XGBoost, since Neptune helps with all the technicalities, like:

  • Metrics logging after each boosting iteration
  • Model logging after training
  • Feature importance 
  • Tree visualization after last boosting iteration

eXtreme Gradient Boosting is an optimized and parallelized open-source implementation of gradient boosting, created by Tianqi Chen, a PhD student at the University of Washington. XGBoost uses decision trees (like random forest) to solve classification (binary & multi-class), ranking, and regression problems. We’re in the area of ​​supervised learning algorithms here.

The idea for this section is to predict Customer Lifetime Value, another important metric to evaluate our customer portfolio. The platform invests in customers making acquisition costs, promotions, discounts, and so on. We should keep track and closely watch current profitable customers, and predict how they’ll evolve in the future.

For this experiment, we’ll be targeting a group of customers during a 9 month period. We will train an XGBoost model with the data of 3 months and try to predict the next 6 months. 

Segregate the data

3 Month users:

from datetime import datetime, date

uk = dataset.query("Country=='United Kingdom'").reset_index(drop=True)
uk['InvoiceDate'] = pd.to_datetime(uk['InvoiceDate'])

users_3m = uk[(uk['InvoiceDate'].dt.date >= date(2010, 12, 1)) & (uk['InvoiceDate'].dt.date < date(2011, 4, 1))].reset_index(drop=True)

6 Month users:

users_6m = uk[(uk['InvoiceDate'].dt.date >= date(2011, 4, 1)) & (uk['InvoiceDate'].dt.date < date(2011, 12, 1))].reset_index(drop=True)

Now, on the 3 Month data frame, apply the same aggregations we made before. Focus on Frequence, Recency and Monetary Value. Also, compute the same cluster rules that we used with K-Means. 

ML development - table cluster

To create the LifeTime Value metric, we’ll be aggregating by the revenue generated on a monthly basis by the 6 Month user group:

users_6m['Profit'] = users_6m['UnitPrice'] * users_6m['Quantity']
aggr = {'Profit': 'sum'}
customers_6 = users_6m.groupby('CustomerID', as_index=False).agg(aggr) customers_6.columns = ['CustomerID', 'LTV']

Then generate K-Means clusters according to that metric:

kmeans = KMeans(n_clusters=3)
kmeans.fit(customers_6[['LTV']])
customers_6['LTVCluster'] = kmeans.predict(customers_6[['LTV']])
ML development - table LTVC

Start the training process

Merge the 3Month table with the 6Month, and you’ll have the same data frame, and training and validation sets that we’ll use in further steps.

classification = pd.merge(customers_3, customers_6, on='CustomerID', how='left')
classification.fillna(0, inplace=True)

Our goal is to come up with classification segments for the LTVCluster relying on core predictive features, such as: MVCluster, FrequencyCluster, RFMScore and Monetary Value. 

However, we don’t yet know their relevance and predictive power. For that matter, we need to run some attribute relevance analysis. 

Attribute relevance analysis

Running attribute relevance analysis, we’ll consider two important functions: recognition of variables with the greatest impact on the target variable, and understanding relations between the most important predictor and target variable. In order to run this kind of analysis, you can use the Information Value and Weight of Evidence approaches.

Note: For more in-depth review of both WoE and IV, I strongly recommend this medium article on Churn Analysis: Churn Analysis Using Information Value and Weight of Evidence, by Klaudia Nazarko.

In our case, we’ll proceed by looking at the correlation between all features, and check Information Value for the MVCluster and the RFMScore.

Correlation Matrix:

classification.corr()['LTVCluster'].sort_values(ascending=False)
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

corrMatrix = classification.corr()
sn.heatmap(corrMatrix, annot=False)
plt.show()
ML development - correlation matrix

SEE ALSO
Neptune’s integrations with visualization libraries (including pandas, matplotlib, and more.


We indeed observe that the more correlated features to the LTVCluster are the Frequency, Monetary Value and Recency, which makes sense. 

In the same way, according to WoE and IV analysis, the MVCluster and RFMScore appear to have more predictive power than the rest.

ML development - analysis

Finally, in order to proceed to further training, we need to convert categorical variables to numeric. One way to quickly do it, is by using pd.get_dummies():

classification = pd.get_dummies(customers)
ML development - dummies

UserSegment column is gone but we have new numerical ones which represent it. We have converted it to 3 different columns with 0 and 1, and made it usable for our machine learning model.

Train XGBoost

Create the experiment

Start by creating a new experiment inside the previous project we’ve initialized. In this section, we’ll be training our data with multiple versions of XGBoost. Each version will be set up with specific hyper-parameters. 


CHECK ALSO
📌 How to Organize Your XGBoost Machine Learning (ML) Model Development Process – Best Practices
📌 Neptune’s integration with XGBoost


Eventually, we’ll try to compare different experiments for even more insights. You can always check Neptune docs to find any relevant resources and documentation, in case you need it to follow along.

params = {
    'max_depth':5,
    'learning_rate':0.1,
    'objective': 'multi:softprob',
    'n_jobs':-1, 
    'num_class':3
}

neptune.create_experiment(
    name='XGBoost-V1',
    tags=['XGBoost', 'Version1'],
    params=params
)

According to hyper-parameters, we need an XGBoost model capable of multi label-classification (hence we’re using the multi:softprob objective function). We’re aiming specifically for three classes within the range of customer LTV Clusters.

Split the data

Split the data into training and testing sets:

X = classification.drop(['LTV', 'LTVCluster', 'lastPurchase'], axis=1)
Y = classification['LTVCluster'] # Target
# Split the Data in two sets: Train and Eval
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.05, random_state=56) 

Instantiate the XGB DMatrix data loaders so that we can conveniently pass our data to the mode:

dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)

Use XGBClassifier Neptune CallBack and log all metrics

It’s time to fit the data to our model. We’ll use an XGBClassifier, and we’ll log all the metrics in real time in the experiment Dashboard. Leveraging Neptune’s tight integration with all different sorts of gradient boosting algorithms, we’re able to monitor the performance and progress very easily.

multi_class_XGB = xgb.XGBClassifier(**params3)
multi_class_XGB.fit(x_train, y_train, eval_set=[(x_test, y_test)], callbacks=[neptune_callback()])

neptune.stop()

There is a very good video that explains how Neptune XGBoost integration works: Integrations – XGBoost.

If we head back to Neptune and click on the experiment we’ve created, we can visualize the chart for the loss, the loss metric, and the feature importance graph.

ML development - visualization

If we want to see how well our model scores on the testing set, we can print a classification report using the sklearn.metrics package.

from sklearn.metrics import classification_report,confusion_matrix
predict = multi_class_XGB.predict(x_test)
print(classification_report(y_test, predict))
ML development - metrics

Although we’re quite satisfied with the previous results, we can still create another experiment, and tweak or change the hyper-parameters somewhat to get even better results.

params2 = {
    'max_depth':5,
    'learning_rate':0.1,
    'objective': 'multi:softprob',
    'n_jobs':-1, 
    'num_class':3, 
    'eta':0.5,
    'gamma': 0.1,
    'lambda':1,
    'alpha':0.35, 
}

neptune.create_experiment(
    name='XGBoost-V2',
    tags=['XGBoost', 'Version2'],
    params=params2,
)

Let’s train:

multi_class_XGB = xgb.XGBClassifier(**params2)
multi_class_XGB.fit(
    x_train, 
    y_train, 
    eval_set=[(x_test, y_test)],
    callbacks=[neptune_callback()])

neptune.stop()

Check accuracy on training and testing sets:

print('Accuracy on Training Set: ', multi_class_XGB.score(x_train, y_train))
print('Accuracy on Testing Set: ', multi_class_XGB.score(x_test[x_train.columns], y_test))

Overall, pretty decent results, almost identical to the previous experiment.

Comparing both experiments

Neptune lets us select multiple experiments and compare them in a dashboard:

ML development - neptune dashboard

We can observe the two experiments side by side, and compare how the parameters within each column experiment actually impact the loss on the training, testing and validation sets.

Version your model

An interesting feature of Neptune is the ability to version model binaries, so that we can keep track of different versions while performing our experiments. 

neptune.log_artifact('xgb_classifier.pkl')

However, when calling neptune_callback() on the training process, the last boosting iteration is automatically logged to Neptune effortlessly.

ML development - neptune artifacts

Conclusion 

The main goal of this tutorial was to help you quickly get started with Neptune. The tool is very easy, and it’s hard to get lost in the UI.

I hope this tutorial was useful to you, as I designed it to cover different aspects of real data science use-cases. I’ll leave you some references to check if you feel that your thirst for knowledge still needs quenching:

Also, don’t forget to check Neptune documentation website and their Youtube channel, where they have in-depth coverage of all the tools that you ‘ll need to start working more efficiently:

Don’t forget to check my Github repo for full code from this tutorial: Neptune-Retail


READ NEXT

MLOps at GreenSteam: Shipping Machine Learning [Case Study]

7 mins read | Tymoteusz Wołodźko | Posted March 31, 2021

GreenSteam is a company that provides software solutions for the marine industry that help reduce fuel usage. Excess fuel usage is both costly and bad for the environment, and vessel operators are obliged to get more green by the International Marine Organization and reduce the CO2 emissions by 50 percent by 2050.

Greensteam logo

Even though we are not a big company (50 people including business, devs, domain experts, researchers, and data scientists), we have already built several machine learning products over the last 13 years that help some major shipping companies make informed performance optimization decisions.

MLOps shipping

In this blog post, I want to share our journey to building the MLOps stack. Specifically, how we:

  • dealt with code dependencies
  • approached testing ML models  
  • built automated training and evaluation pipelines 
  • deployed and served our models
  • managed to keep human-in-the-loop in MLOps
Continue reading ->

The Best MLOps Tools You Need to Know as a Data Scientist

Read more
Model Management

Machine Learning Model Management: What It Is, Why You Should Care, and How to Implement It

Read more
Experiment tracking in project management

How to Fit Experiment Tracking Tools Into Your Project Management Setup

Read more
MLOps best practices

MLOps: 10 Best Practices You Should Know

Read more