MLOps Blog

How to Organize Your ML Development in an Efficient Way

10 min
21st August, 2023

One major issue that every data scientist and ML practitioner will eventually encounter is workflow management. Testing different scenarios and use cases, logging information and details, sharing and comparing results from a particular set of samples, visualizing the data, keeping track of insights. These are key components of data science workflow management. They help business and enable you to scale any data science project.

Data scientists know well that testing one version of an ML algorithm is not enough. Our field strongly relies on empiricism, so we need to test and compare multiple versions of the same algorithm with different hyperparameter tuning, and feature selection.

All of this generates metadata, which needs to be stored properly. To do this, I use a platform that can manage all that stuff for me – It comes with a complete client library that you can seamlessly integrate into your code. They also give you access to a web-based UI where all your data is logged and available. Check this short explainer to learn what Neptune does.

To give you a tour of what Neptune has to offer, I simulated a real use-case scenario with a prepared online dataset. We’ll be running different analytics and ML processes to see how well Neptune can support you in daily work.

To quickly enable you to start integrating Neptune into all project aspects, it might be useful to know how to install the packages and libraries and how to connect your Jupyter Notebook to your Neptune account. 

Here’s the installation documentation that will get you up to speed.

Next, I recommend you check the first steps guide in the docs. It explains how to create a project and how to add Neptune to your code.

Note: I include code where it’s most instructive, if you want to check the full code version and the notebooks, feel free to visit my Github repo — Neptune-Retail. However, please note it was created in March 2021, so some code might be outdated now. For the most up-to-date examples, please refer to the Neptune documentation.

Exploring the dataset

We’ll be taking a look at an online retail dataset, publicly available at Kaggle. The dataset records various customers from all around the world that use an online selling platform. Each record informs about an order to purchase a specific product. 

The dataset appears as follows:

ML development - dataset

To start loading the dataset, I created a small python DataManager class to download the CSV file, extract the main features and transform them into a usable pandas dataframe:

class DataETLManager:
    def __init__(self, root_dir: str, csv_file: str):
        if os.path.exists(root_dir):
            if csv_file.endswith('.csv'):
                self.csv_file = os.path.join(root_dir, csv_file)
                logging.error('The file is not in csv format')
            logging.error('The root dir path does not exist')

        self.retail_df = pd.read_csv(self.csv_file, sep=',', encoding='ISO-8859-1')

    def extract_data(self):
        return self.retail_df

    def fetch_columns(self):
        return self.retail_df.columns.tolist()

    def data_description(self):
        return self.retail_df.describe()

    def fetch_categorical(self, categorical=False):
        if categorical:
            categorical_columns = list(set(self.retail_df.columns) - set(self.retail_df._get_numerical_data().columns))
            categorical_df = self.retail_df[categorical_columns]
            return categorical_df
            non_categorical = list(set(self.retail_df._get_numerical_data().columns))
            return self.retail_df[non_categorical]

    def transform_data(self):
        data = self.retail_df

        # Checking and eliminating redundant information:
        data.drop_duplicates(keep='last', inplace=True)

        # Fill null Values:
        data['InvoiceNo'].fillna(value=0, inplace=True)
        data['Description'].fillna(value='No Description', inplace=True)
        data['StockCode'].fillna(value='----', inplace=True)
        data['Quantity'].fillna(value=0, inplace=True)
        data['InvoiceDate'].fillna(value='00/00/0000 00:00', inplace=True)
        data['UnitPrice'].fillna(value=0.00, inplace=True)

        data['CustomerID'].fillna(value=0, inplace=True)
        data['Country'].fillna(value='None', inplace=True)

        # Format value columns:
        data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

        self.data_transfomed = data

The important columns that we can leverage to start building our internal core metrics are: 

  • The InvoiceDate
  • Quantity
  • UnitPrice
  • CustomerID
  • Country

Start by loading the dataset using the DataETLManager:

etl_manager = DataETLManager(root_dir='./Data', csv_file='OnlineRetail.csv')

dataset = etl_manager.data_transfomed

For a retail business, the core value relies on the revenue the platform generates through customer orders. We can form a monthly revenue combining the UnitPrice with the Quantity, and aggregating those by the InvoiceDate:

dataset['Profit'] = dataset['Quantity'] * dataset['UnitPrice']
revenue = dataset.groupby(['InvoiceDate'])['Profit'].sum().reset_index()

We could also visualize how the revenue evolves across the months by plotting the following chart:

import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.offline as pyoff


data = go.Scatter(

layout = go.Layout(
    xaxis={"type": "category"},
    title='Monthly Revenue'

fig = go.Figure(data, layout)
ML development - graph

Since we’re mainly targeting customers, one metric that should be worth attention is the number of active customers that our platform retains. We will conduct our experiments exclusively targeting UK customers, as they constitute the majority of the data sample.

ML development - chart customers

To study active customer retention we need to check how much customer orders were made through each month:

uk_customers = dataset.query("Country=='United Kingdom'").reset_index(drop=True)
activeCustomers = dataset.groupby(['InvoiceDate'])['CustomerID'].nunique().reset_index()

The distribution appears to be quite monotonic with a peak in November 2011.

For our case study, we would like to properly segment those customers. This way, we could efficiently manage the portfolio and dissect the different levels of value each group actually offers. 

We should also keep in mind that as the business grows in size, it won’t be possible to have an intuition about each and every customer. At that stage, human judgments about which customers to pursue won’t work, and the business will have to use a data-driven approach to build a proper strategy.

In the next section, we’ll be digging deeper into the different metrics and analysis that we can leverage to appropriately segment our customer base. 

ML development - chart users


Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

Extract metrics and run analytics on the data

In this section, we’ll thoroughly analyze the data. We want to segment the whole customer base according to financial criteria. At the end of this section, we should be able to profile and know our customer purchase behavior. 

As we’ve already initialized our project in Neptune, we’ll kick off our first experiment logging the statistics we’ll be pulling out throughout this section. You can think of a Neptune experiment as a namespace to which you can log metrics, predictions, visualizations, and anything else you might need. 

Start by initializing the parameters for this experiment, and call the neptune.init_run() method.

params = {
    'max_iterations': 1000,
    'first_metric': 'Recency',
    'second_metric': 'Frequency',
    'third_metric': 'Monetary Value',
    'users': 'UK'

run = neptune.init_run(

Once you run the notebook cell, you can head to the website. If you open the experiment, we’ve just created, under Parameters, you will find the values properly logged and ready to track further actions.

ML development - parameters

In order to segment our customer base according to profitability and growth potential, we’ll be focusing on three main factors that eventually will shape our customer financial behavior. This criteria relies on three factors that constitute the so called RFM Score:

  • Recency of use: A metric to monitor how recent the user activity is
  • Frequency of use: How often do users purchase products on the platform
  • Monetary Value: Literally, how profitable they are

First we need to elaborate the metrics from the dataset. Then, we’ll perform clustering on those data points such that we can group them by similarity within different categories from highly to less valuable customers. Insights from customer segmentation are used to develop tailor-made marketing campaigns and for designing marketing strategy. 

For this task, K-Means clustering algorithm remains a very powerful tool. It’s simplicity of use along with performance add up to a perfect balance for our use case.

For a detailed explanation on how K-Means works, I recommend this article that perfectly does the job: K-Means Clustering — Explained.

RFM Score

The idea is to measure how many days since the last purchase, thus measuring the number of days of recorded inactivity on the platform. We can calculate it as the max purchase date for all customers minus the overall max date within that range. 
Create the customer data frame we’ll be working on:

customers = pd.DataFrame(dataset['CustomerID'].unique())
customers.columns = ['CustomerID']

Aggregate the Max Invoice Date:

## Recency ##
aggregatR = {'InvoiceDate': 'max'}

Generate the Recency Score:

# Generating R Score
customers['Recency'] = (customers['LastPurchaseDate'].max() - customers['LastPurchaseDate']).dt.days

Customer Recency:

ML development - table customers

As we have the corresponding table, it would be a good idea to log it to our Neptune experiment

To do so, we can call the method upload() as follows:

from neptune.types import File
run(["Recency English Users"].upload(File.as_html(recency_UK))
ML development - neptune log table

Now you can proceed to apply K-Means to cluster our Recency distribution. Before that, we need to define the number of clusters that will best suit our needs. One way to do it is the Elbow method. The Elbow Method simply tells the optimal cluster number for optimal inertia.

K-means_metrics = {}

for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(customers['Recency'])
    customers["clusters"] = kmeans.labels_
    k-means_metrics[k] = k means.inertia_

Let’s plot the values in Neptune, se we can check in thorough detail how the curve evolves:

for val in kmeans_metrics.values():

Neptune automatically logs the values in the Logs section and generates a graph chart accordingly.

ML development - neptune chart

According to the graph, the best optimal cluster number is 4. So we’ll proceed using 4 clusters for the three metrics.

K-Means for Recency:

kmeans = KMeans(n_clusters=4)[['Recency']])
customers['RecencyCluster'] = kmeans.predict(customers[['Recency']])

Let’s log the Recency Distribution and the predicted clusters.

# Logging the obtained clusters:
for cluster in customers['RecencyCluster']:
    run["UK Recency Clusters'].append(cluster)

# Logging the recency distribution:    
for rec in customers['Recency']:
    run(["Recency in days"].append(rec)
ML development - neptune metrics

If you zoom closely in the Recency days graph, you’ll notice that the values range between 50 and 280 days. 

ML development - recency

We can check more information about the clusters distribution by taking a look a some general statistics:

ML development - recency

We can notice that customers in cluster 2 are more recent than those in cluster 1. 

Let’s advance our investigations by computing the other clusters for frequency and Monetary Value respectively. We’ll try to have a more high-level comparison between the three metrics.

Aggregate the number of orders by Customer:

customers = pd.DataFrame(dataset['CustomerID'].unique())
customers.columns = ['CustomerID']

## Frequency ##
aggregatF = {'InvoiceDate': 'count'}
freq = dataset.groupby('CustomerID', as_index=False).agg(aggregatF)
customers = pd.merge(customers, freq, on='CustomerID')

K-Means for Frequency Score:

kmeans = KMeans(n_clusters=4)[['Frequency']])
customers['FrequencyCluster'] = kmeans.predict(customers[['Frequency']]

When combining with the previous frames we obtain the following table:

ML development - table freq rec

Aggregate the  sum of profit generated by each customer:

## MonetaryValue ##
dataset['Profit'] = dataset['UnitPrice'] * dataset['Quantity']
aggregatMV = {'Profit': 'sum'}
mv = dataset.groupby('CustomerID', as_index=False).agg(aggregatMV)
customers = pd.merge(customers, mv, on='CustomerID')

customers.columns = ['CustomerID', 'lastPurchase', 'Recency', 'Frequency', 'MonetaryValue']
ML development - table

Then we group all the metrics together, to have a general overview.

K-Means for Monetary Value:

kmeans = KMeans(n_clusters=4)[['MonetaryValue']])
customers['MonetaryCluster'] = kmeans.predict(customers[['MonetaryValue']])

To have a general RFM Score that takes into consideration all the values we’ve just gathered, we need to sum up the different clusters in a unique Overall Score. We then segment each customer portion as per the ranges values obtained. 

Three segments:

  • High Value: Scores from 0-2
  • Mid Value: Scores from 3-6
  • High Value: Score from 6-9
# Forming the RFM Overall Score:
customers['RFMScore'] = customers['RecencyCluster'] + customers['FrequencyCluster'] + customers['MonetaryCluster']
customers['UserSegment'] = 'Low'

# User Classification regarding the RFM Score:
customers.loc[customers['RFMScore'] <= 2, 'UserSegment'] = 'Low'
customers.loc[customers['RFMScore'] > 2, 'UserSegment'] = 'Mid'
customers.loc[customers['RFMScore'] > 5, 'UserSegment'] = 'High'
ML development - overall score

The best part comes when we plot clusters and visualize how they’re distributed, comparing the Frequency and Recency metric with the Monetary Value that they generate.

ML development - RFM Segmentation
RFM Segmentation | Source: Customer Segmentation

Both metrics clearly indicate that recent and frequent the users are more profitable. So, we should improve retention for the high value users (in red), and make a decision based on that criteria. Also, by improving the user retention rate, we immediately impact their frequency and recency on the platform. This means that we should also operate on user engagement.  

Organizing ML development in Neptune

In this section we’ll take advantage of one excellent feature that Neptune offers, which is ML integrations. In our case we’ll be closely looking to XGBoost, since Neptune helps with all the technicalities, like:

  • Metrics logging after each boosting iteration
  • Model logging after training
  • Feature importance 
  • Tree visualization after last boosting iteration

eXtreme Gradient Boosting is an optimized and parallelized open-source implementation of gradient boosting, created by Tianqi Chen, a PhD student at the University of Washington. XGBoost uses decision trees (like random forest) to solve classification (binary & multi-class), ranking, and regression problems. We’re in the area of ​​supervised learning algorithms here.

The idea for this section is to predict Customer Lifetime Value, another important metric to evaluate our customer portfolio. The platform invests in customers making acquisition costs, promotions, discounts, and so on. We should keep track and closely watch current profitable customers, and predict how they’ll evolve in the future.

For this experiment, we’ll be targeting a group of customers during a 9 month period. We will train an XGBoost model with the data of 3 months and try to predict the next 6 months. 

Segregate the data

3 Month users:

from datetime import datetime, date

uk = dataset.query("Country=='United Kingdom'").reset_index(drop=True)
uk['InvoiceDate'] = pd.to_datetime(uk['InvoiceDate'])

users_3m = uk[(uk['InvoiceDate'] >= date(2010, 12, 1)) & (uk['InvoiceDate'] < date(2011, 4, 1))].reset_index(drop=True)

6 Month users:

users_6m = uk[(uk['InvoiceDate'] >= date(2011, 4, 1)) & (uk['InvoiceDate'] < date(2011, 12, 1))].reset_index(drop=True)

Now, on the 3 Month data frame, apply the same aggregations we made before. Focus on Frequence, Recency and Monetary Value. Also, compute the same cluster rules that we used with K-Means. 

ML development - table cluster

To create the LifeTime Value metric, we’ll be aggregating by the revenue generated on a monthly basis by the 6 Month user group:

users_6m['Profit'] = users_6m['UnitPrice'] * users_6m['Quantity']
aggr = {'Profit': 'sum'}
customers_6 = users_6m.groupby('CustomerID', as_index=False).agg(aggr) customers_6.columns = ['CustomerID', 'LTV']

Then generate K-Means clusters according to that metric:

kmeans = KMeans(n_clusters=3)[['LTV']])
customers_6['LTVCluster'] = kmeans.predict(customers_6[['LTV']])
ML development - table LTVC

Start the training process

Merge the 3Month table with the 6Month, and you’ll have the same data frame, and training and validation sets that we’ll use in further steps.

classification = pd.merge(customers_3, customers_6, on='CustomerID', how='left')
classification.fillna(0, inplace=True)

Our goal is to come up with classification segments for the LTVCluster relying on core predictive features, such as: MVCluster, FrequencyCluster, RFMScore and Monetary Value. 

However, we don’t yet know their relevance and predictive power. For that matter, we need to run some attribute relevance analysis. 

Attribute relevance analysis

Running attribute relevance analysis, we’ll consider two important functions: recognition of variables with the greatest impact on the target variable, and understanding relations between the most important predictor and target variable. In order to run this kind of analysis, you can use the Information Value and Weight of Evidence approaches.

Note: For more in-depth review of both WoE and IV, I strongly recommend this medium article on Churn Analysis: Churn Analysis Using Information Value and Weight of Evidence, by Klaudia Nazarko.

In our case, we’ll proceed by looking at the correlation between all features, and check Information Value for the MVCluster and the RFMScore.

Correlation Matrix:

import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

corrMatrix = classification.corr()
sn.heatmap(corrMatrix, annot=False)
ML development - correlation matrix


Neptune’s integrations with visualization libraries (including pandas, matplotlib, and more.

We indeed observe that the more correlated features to the LTVCluster are the Frequency, Monetary Value and Recency, which makes sense. 

In the same way, according to WoE and IV analysis, the MVCluster and RFMScore appear to have more predictive power than the rest.

ML development - analysis

Finally, in order to proceed to further training, we need to convert categorical variables to numeric. One way to quickly do it, is by using pd.get_dummies():

classification = pd.get_dummies(customers)
ML development - dummies

UserSegment column is gone but we have new numerical ones which represent it. We have converted it to 3 different columns with 0 and 1, and made it usable for our machine learning model.

Train XGBoost

Create the experiment

Start by creating a new experiment inside the previous project we’ve initialized. In this section, we’ll be training our data with multiple versions of XGBoost. Each version will be set up with specific hyper-parameters. 


How to Organize Your XGBoost Machine Learning (ML) Model Development Process – Best Practices
How to keep track of XGBoost model building metadata: Neptune+ XGBoost integration

Eventually, we’ll try to compare different experiments for even more insights. You can always check Neptune docs to find any relevant resources and documentation, in case you need it to follow along.

params = {
    'objective': 'multi:softprob',

run = neptune.init_run(
    tags=['XGBoost', 'Version1'],

According to hyper-parameters, we need an XGBoost model capable of multi label-classification (hence we’re using the multi:softprob objective function). We’re aiming specifically for three classes within the range of customer LTV Clusters.

Split the data

Split the data into training and testing sets:

X = classification.drop(['LTV', 'LTVCluster', 'lastPurchase'], axis=1)
Y = classification['LTVCluster'] # Target
# Split the Data in two sets: Train and Eval
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.05, random_state=56) 

Instantiate the XGB DMatrix data loaders so that we can conveniently pass our data to the mode:

dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)

Use XGBClassifier Neptune CallBack and log all metrics

It’s time to fit the data to our model. We’ll use an XGBClassifier, and we’ll log all the metrics in real time in the experiment Dashboard. Leveraging Neptune’s tight integration with all different sorts of gradient boosting algorithms, we’re able to monitor the performance and progress very easily.

multi_class_XGB = xgb.XGBClassifier(**params3), y_train, eval_set=[(x_test, y_test)], callbacks=[neptune_callback()])


There is a very good video that explains how Neptune XGBoost integration works: How to Use to Track Experimentation: An Example With Structured Data and XGBoost.

If we head back to Neptune and click on the experiment we’ve created, we can visualize the chart for the loss, the loss metric, and the feature importance graph.

ML development - visualization

If we want to see how well our model scores on the testing set, we can print a classification report using the sklearn.metrics package.

from sklearn.metrics import classification_report,confusion_matrix
predict = multi_class_XGB.predict(x_test)
print(classification_report(y_test, predict))
ML development - metrics

Although we’re quite satisfied with the previous results, we can still create another experiment, and tweak or change the hyper-parameters somewhat to get even better results.

params2 = {
    'objective': 'multi:softprob',
    'gamma': 0.1,

run = neptune.init_run(
    tags=['XGBoost', 'Version2'],

Let’s train:

multi_class_XGB = xgb.XGBClassifier(**params2)
    eval_set=[(x_test, y_test)],


Check accuracy on training and testing sets:

print('Accuracy on Training Set: ', multi_class_XGB.score(x_train, y_train))
print('Accuracy on Testing Set: ', multi_class_XGB.score(x_test[x_train.columns], y_test))

Overall, pretty decent results, almost identical to the previous experiment.

Comparing both experiments

Neptune lets us select multiple experiments and compare them in a dashboard:

ML development - neptune dashboard

We can observe the two experiments side by side, and compare how the parameters within each column experiment actually impact the loss on the training, testing and validation sets.


The main goal of this tutorial was to help you quickly get started with Neptune. The tool is very easy, and it’s hard to get lost in the UI.

I hope this tutorial was useful to you, as I designed it to cover different aspects of real data science use-cases. I’ll leave you some references to check if you feel that your thirst for knowledge still needs quenching:

Also, don’t forget to check Neptune documentation website and their Youtube channel, where they have in-depth coverage of all the tools that you ‘ll need to start working more efficiently:

Don’t forget to check my Github repo for full code from this tutorial: Neptune-Retail

Was the article useful?

Thank you for your feedback!