MLOps Blog

In-Depth ETL in Machine Learning Tutorial – Case Study With Neptune

10 min
21st August, 2023

Most of the time, as data scientists, we think that our core value is our ability to figure out a machine learning algorithm that solves a task. In reality, model training is just the final part of a large body of work, mainly with data, that’s required just to start building a model.

Before ML solutions or highly sophisticated neural networks, an entire data infrastructure should be already in place, tested and ready. Unfortunately, we often take for granted the enormous work needed in the upstream part of the whole data pipeline. 

Data should be our main concern, along with finding a way to transform it into actionable insights. Nowadays, data is considered a high value resource to any company seeking to lead in the rapidly evolving world. 

In this article, I’ll present a set of concepts and processes for building a data infrastructure, and different building blocks of a successful data pipeline. I’ll focus on:

  • Explaining general-purpose data pipelines 
  • ETL (Extract, Transform, Load) processes involving ML practices
  • Automating data workflows using Apache Airflow
  • Training and managing different ML model solutions in Neptune

To make the article more practical and instructive, I’ll use a Credit Scoring dataset to run all the experiments. I’m a strong believer in learning by doing, so let’s jump right into it! 

Note: The complete code to follow along can be found in my Github repo. Don’t hesitate to clone it and try it yourself.  

Data pipelines and workflows

As usually in business, we often deal with large amounts of data for very diverse purposes. Building robust pipelines can get quite complex, especially when data is scarce and transformation processes involve a lot of technicalities. 

Efficient data transportation relies on three important blocks:

  1. Data Producers

Data source points where raw data lies ready to be fetched.

  1. Transformation and Transportation workflows

ETL sub processes that involve stacks of extraction, transformation and data loading layers to route the data to its corresponding end points.

  1. Data Consumers

Final end points that utilize clean and preprocessed information to perform high end taks.

ETL Schema
ETL Schema | Source: InfoLibrarian

Pipelines are very generic and serve different purposes depending on the business plan. They usually share the general concepts, but the implementation will be different case-by-case. 

In our case, the data set is already prepared. What we need is to engineer an ETL process that transforms our data according to what we pretend to do with it.

Read more

How to Build ETL Data Pipeline in ML

Exploring the credit scoring dataset 

In this section, we’ll thoroughly analyse a credit scoring dataset. The dataset will help us implement and test our ETL workflows. 

Banks use credit scoring models to assign credit scores to customers. These scores represent the customers’ overall creditworthiness.The credit score came from the United States in the early 1950s, for creditors to assess the financial strength of clients with a history of credits. Today, this technique has become a real financial institution. This rating is for all Americans with a social security number.

The dataset we’ll be using, “Give Me Some Credit”, is from Kaggle. I’ll give a brief overview of the data structure, explain the nature of each feature, and expose a set of general statistics.

Data features

  • Revolving Utilization Of Unsecured Lines: Total balance on credit cards and personal lines of credit except for real estate, and no installment debt, like car loans, divided by the sum of credit limits
  • Age: Age of borrower in years
  • The number of times 30-59 days past due not worse: Number of times borrower has been 30-59 days past due, but no worse, in the last 2 years
  • Debt Ratio: Monthly debt payments, alimony, living costs divided by monthly gross income
  • Monthly Income: Monthly income per person
  • The number of open credit lines and loans: Number of Open loans (installment like car loan or mortgage) and lines of credit (e.g. credit cards)
  • The number of times 90 days Late: Number of times borrower has been 90 days or more past due
  • The number of real estate loans or lines: Number of mortgage and real estate loans including home equity lines of credit
  • The number of times 60-89 days past due not worse: Number of times borrower has been 60-89 days past due, but no worse, in the last 2 years
  • The number of dependents: Number of dependents in family excluding themselves (spouse, children, etc.)


  • Serious Delinquency in 2 years: Person experienced 90 days past due delinquency or worse

The target indicates the delinquency of a debtor measured in a time window of two years. A value of 1 means that the borrower is delinquent and has defaulted on his loans for the last 2 years. Value of 0 means that the borrower is a good customer and has repaid debts on time for the last two years.

Commonly, most financial industry data contains missing values, or values that don’t make sense for a particular characteristic. This particular dataset is no exception, with inconsistencies regarding the debt and credit balance ratios, and values way over what should be admitted. 

So, we’ll apply data transformations to get rid of all discrepancies that would alter the results in the modelization and training stages.

For now, let’s focus on the descriptive analysis:

Unsurprisingly, the target characteristic is highly unbalanced, which will lead to serious problems in model training. Having 86.3% of debtors classified as bad payers will induce the model to overfit for this particular class, and completely disregard the other.

By looking at the Age distribution, we clearly observe that the age group between the 40s and 50s contains most samples, and the other remaining groups are more or less equally balanced.

Age distribution
Age distribution regarding Frequency and Cumulative Frequency

Age analyses show that the most indebted segment of the population is between 35 and 70 years of age. On top of that, debt is always higher for populations with the lowest monthly salaries.

Debt ratio
Debt ratio distribution regarding Age and Monthly Income

The segment of the population that accumulates the most financial resources is between 45 and 60, which is quite logical and plausible.

Monthly Income distribution by Age
Monthly Income distribution by Age

Revolving debts follow a trend quite similar to long-term debts, with populations that are quite young, and always more and more indebted. The most obvious correlations in the data are those between short and long term debts, age group and salary.

Revolving credits ratio by Age and Monthly Income

Now that we have fairly more knowledge and understanding about the dataset, we can proceed to the ETL processes for data extraction and transformation.

Build and automate your ETL process

In this section, we’ll implement an ETL workflow that extracts data from a csv file to a usable pandas dataframe, and we’ll apply all the required transformations to have a clean and prepared dataset for the ML models. To make things more efficient, we’ll try to simulate an automation cycle for the process to run through.

This section will introduce concepts such as Airflow Directed Acyclic Graphs for process automation, and Factor Analysis procedures to implement data transformations.

Data extraction 

We want to extract the data from the csv file, and make it usable for our experimental purposes. To do, first we create a small Python Data Manager class that will take care of parsing the csv, extract, and format any relevant data for our analysis.

class DataETLManager:
    def __init__(self, root_dir: str, csv_file: str):
        if os.path.exists(root_dir):
            if csv_file.endswith('.csv'):
                self.csv_file = os.path.join(root_dir, csv_file)
                logging.error('The file is not in csv format')
            logging.error('The root dir path does not exist')

        self.credit_scoring_df = pd.read_csv(self.csv_file, sep=',', encoding='ISO-8859-1')
    def extract_data(self):
        return self.credit_scoring_df

    def fetch_columns(self):
        return self.credit_scoring_df.columns.tolist()

    def data_description(self):
        return self.credit_scoring_df.describe()

    def fetch_categorical(self, categorical=False):
        if categorical:
            categorical_columns = list(set(self.credit_scoring_df.columns) - set(self.credit_scoring_df._get_numerical_data().columns))
            categorical_df = self.credit_scoring_df[categorical_columns]
            return categorical_df
            non_categorical = list(set(self.credit_scoring_df._get_numerical_data().columns))
            return self.credit_scoring_df[non_categorical]

Three important methods to take a closer look at:

  • extract_data(): returns the credit scoring data frame we’ve just created
  • fetch_columns(): returns all the columns in the data
  • data_description(): useful if we want a have a quick glance at the structure of the data
  • fetch_categorical(): returns the categorical values in the data frame

We’ll change the column names to be shorter:

credit_df=credit_df.drop('Unnamed: 0', axis=1)
credit_df.columns = ['Target', 'Revolving', 'Age', '30-59PastDue', 'DbtRatio', 'Income', 'NumOpenLines', 'Num90DayLate', 'NumRealEstLines', '60-89PastDueNoW', 'FamMemb']

The result looks like this:

Credit Scoring datatset
Credit Scoring datatset

Data transformation

We’ll focus on two transformation phases: 

  • Preprocessing transformations
  • Analytics transformations

The idea is that we absolutely need to preprocess incoming raw data, eliminate duplicates, drop null and missing values. Furthermore, doing univariate analysis we’ll soon observe that many of the samples had out of range values for ratio variables. Typically, we need to detect and delete outliers. 

With usable data, we’ll start implementing Factor Analysis to extract insightful features that best explain variance and correlation. 

Preprocessing transformations

Dropping duplicates and missing values:

def transform_data(self):
        # 1 Drop Duplicates:
        self.credit_scoring_df.drop_duplicates(keep='last', inplace=True)

        # 2 Drop null values:
        self.credit_scoring_df.dropna(how='all', inplace=True)

Removing outliers: 

  • For the age feature which is a continuous variable from 0 to a maximum of 100. There are certain records, with a value of zero, that don’t make sense and can’t be qualified as a borrower, the person must be an adult of 18 years.
  • Checking for upper bounding and top-coding methods for the debt and revolving ratios with values greater than 1, which means that all values above the upper band will be dropped.
  • Exclude characteristics or records that have significant (more than 50%) missing values from the model, especially if the level of missing is important enough concerning the rate of data unbalancing, which is quite high.
clean_credit = self.credit_scoring_df.loc[self.credit_scoring_df['Revolving'] <= 1]
clean_credit = clean_credit.loc[clean_credit['DbtRatio'] <= 1]
clean_credit = clean_credit.loc[clean_credit['Age'] <= 100]
clean_credit = clean_credit.loc[clean_credit['Age'] >= 18]
clean_credit = clean_credit.loc[clean_credit['FamMemb'] < 20]

Analytics transformations

After cleaning the data, we’ll standardize the data with mean=0 and std deviation=1, except for the binary dependent variable which is the Target. 

def normalize(dataset):
    return dataNorm

clean_scaled_df = normalize(clean_credit)

The data values range from 0 to 1:

Normalized data values
Normalized data values

Start with factor analysis

Before segregating the data for training phases, we want to check correlation between all the variables, to get an idea on how the variables will be combined. 

Factor analysis is a linear statistical model. It’s used to explain the variance among observed variables, and condense a set of observed variables into the unobserved variable, called factors. Observed variables are modeled as a linear combination of factors and error terms. 

First, let’s run an adequacy test to check whether the dataset is suitable for Factor Analysis or not. We’ll perform Barlett’s test of sphericity.

from scipy.stats import chi2, pearsonr
import numpy as np

def barlett_test(frame: pd.DataFrame):
    # print dataframe info:

    col, row = frame.shape
    x_corr = frame.corr()

    corr_det = np.linalg.det(x_corr)
    chi_measure = -np.log(corr_det) * (col - 1 - (2 * row + 5) / 6)
    degrees_of_freedom = row * (row - 1) / 2
    p_value = chi2.sf(statistic, degrees_of_freedom)
    return chi_measure, p_value
chi_square, p_value = barlett_test(clean_scaled_df)
(1003666.113272278, 0.0)

Highly Suitable: The observed correlation is different from the identity matrix, H0 and H1 hypothesis are verified.

We’ll use the factor_analyzer python package, install it with the following command:

pip install factor_analyzer

Fit the data to the FactorAnalyzer class that we’ll run Kaiser criterion internal statistics to come up with EigenValues in the data.

from factor_analyzer import FactorAnalyzer

fa = FactorAnalyzer()

# EigenValues:
ev, v = fa.get_eigenvalues()
eigen_values = pd.DataFrame(ev)

Original EigenValues:

There are four principal components whose eigenvalue is more than 1:

Eigen Values
Eigen Values after Running Factor Analysis

Plotting a Scree plot we can easily visualize the four relevant factors we need:

import matplotlib.pyplot as plt

plt.title('Scree Plot')
Scree plot for the 4 factors
Scree plot for the 4 factors

Learn more

Check how to log charts generated in matplotlib to Neptune.

Let’s perform Factor Analysis rotation on those 4 factors to have a better interpretation. Rotation can be orthogonal or oblique. It helps re-distributing the communalities between the observed variables with a clear pattern of loadings.

fac_rotation = FactorAnalyzer(n_factors=4, rotation='varimax')

Factor 1
Factor 2
Factor 3
Factor 4

SS Loading





Prop Variance





Cum Variance





We conclude that the variance explained by these four components is 51.4%, which is satisfactory. Based on this, we decide to use four factors for further analysis.

Displaying the 4 Factors with their respective observable features, we see that:

Factors and Observable Features
Factors and Observable Features side by side

I Observation

  • The variables 30-59PastDue, 60-89PastDueNoW and Num90DayLate load high on Factor1. This means that this factor causes the borrower to be more and more delinquent. Hence we can name this factor as Financial Struggle.

II Observation

  • The variables DbtRatio, NumOpenLines and NumRealEstLines load high on Factor2. Which means that this factor causes the borrower to engage in more debt and more credit lines. We can name it Finance Requirements.

III Observation

  • Variables Age, FamMemb and Revolving load high on Factor3 with Age being indirectly proportional to Factor 3. This factor has been named as Expendable Income because as age increases, expendable income increases.

IV Observation

  • The Variables Income and FamMemb load high on Factor4. So we can easily name it Behavioral LifeStyle because as income increases, so does our lifestyle and the lifestyle of our family members. 

So, now we have the four factors for ML analysis:

  • Financial Struggle
  • Finance Requirements
  • Expendable Income
  • Behavioral Lifestyle

Automate the ETL process with DAGs

Once the Extract and Transform pipelines are formed and ready for deployment, we can start thinking of a way to store the previous data in a database. 

The Load process comes into picture, to simplify and for the purposes of illustration, we’re just going to load the previous transformed data frame into a local MySQL database using the Sql_Alchemy python package. 

pip install SQLAlchemy
from sqlalchemy.engine import create_engine

def load_data(self):
        database_config = {
            'username': 'your_username',
            'password': 'your_pass',
            'host': '',

        # create database connection using engine:
        engine = create_engine('mysql+mysqlconnector://{}:{}@{}:{}/{}'.format(

        data_to_load = type(pd.DataFrame())(self.credit_scoring_df)
            data_to_load.to_sql('Credit Scoring', con=engine, if_exists='append', index=False)
        except Exception as err:

Airflow DAGs: Directed Acyclic Graphs

Airflow schedules automated data workflows that include multiple tasks or processes that share specific dependencies. The Airflow documentation defines DAGs as follows:

In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code

To install and start using Airflow, you can check the website, which explains each step thoroughly: AirFlow Doc.

In our case, we’ll code a small DAG file to simulate the automation of our ETL. We’ll schedule the DAG to run daily starting from the 2021/03/25. The DAG will have three PythonOperators representing the Extract, Transform and Load functions. 

from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
from etl_process import DataETLManager, DATA_PATH

default_dag_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2021, 3, 9),
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=1),

etl_dag = DAG(
    description='DAG for ETL retail process',

The PyhtonOperators in charge of running each ETL process:

etl_manager = DataETLManager(DATA_PATH, 'OnlineRetail.csv')

extract = PythonOperator(

transform = PythonOperator(

load = PythonOperator(

Finally, we define tasks correlation: Extract, then Transform, then Load into the Database.

extract >> transform >gt; load

Run predictions and benchmark results

In this part, we’ll segregate the transformed and reduced data we’ve obtained into training and testing sets, and use three ensemble algorithms to predict serious delinquency using the 4 factors that we previously pulled out. 

We’ll be integrating all our ML development with Neptune, to keep track and compare all the metrics that we’re getting along the way.

Using three ML models, we’ll be able to compare the results and see where each approach can perform better. We’ll be testing three ML models for this particular task:

  • Logistic Regression
  • Decision Trees
  • Extreme Gradient Boosting

Set Neptune for development

Install the required neptune client libraries to start integrating your code with Neptune. Check the Neptune documentation to learn how to do it.

Logistic regression

The main goal of regression models is to find a combination between the predictors that best fit a set of data according to a specific mathematical criterion.

The most common regression model used in the case of a categorical response data problem is logistic regression. Typically, logistic regression uses, in its basic form, a logistic function to model a binary dependent variable based on various predictors.

The training of the model is notably straightforward. We used the well-known machine learning library SciKit-Learn to implement the logistic model. The effort made previously on the data processing will help us quite a lot during this part.

The table with previously defined factors looks like this:

ETL final_table
Table that aggregates the values for the 4 factors
credit_scoring_final = pd.DataFrame(new_data_frame, columns=['Financial_Struggle', 'Finance_Requirements', 'Expendable_Income', 'Behavioral_LifeStyle'])

Segregate the data in training and testing sets:

X = credit_scoring_final
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=56)

Train with Logistic Regression:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=0.00026366508987303583, class_weight=None, dual=False, max_iter=100, multi_class='auto', n_jobs=None, penalty='l1',
random_state=None, solver='saga'), y_train)

Test Results:

from sklearn.metrics import accuracy_score, classification_report

predictions = model.predict(x_test)
test_accuracy = accuracy_score(y_test, predictions)
Factor 4






ROC Curve and Confusion Matrix:

ROC Curve and Confusion Matrix
ROC Curve and Confusion Matrix for the Logistic Regression

The model will always predict largely better bad debtors over good debtors, because of their proportion in the data. It’s inevitable, and regression models can’t surpass this limitation, no matter the level of engineering involved.


EXtreme Gradient Boosting (XGBoost) is an optimized and parallelized open source implementation of gradient boosting created by Tianqi Chen, a PhD student at the University of Washington. 

XGBoost uses decision trees (like random forest) to solve classification (binary & multi-class), ranking, and regression problems. So, we’re in the area of supervised learning algorithms here.

Start by initializing the neptune experiment:

import neptune
from from neptune.integrations.xgboost import NeptuneCallback 

params_attempt3 = {
    'colsample_bytree': 0.7,
    'gamma': 0.3,
    'n_estimator': 100,
    'objective': 'binary:logistic',
    'eval_metric': 'error'

run = neptune.init_run(
    name='CreditScoring XGB',
    tags=['XGBoost', 'Credit Scoring'],

Split the data and Instantiate the DMatrix data loaders:

x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=56)

dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)

Let’s start training the model and track each metric using Neptune XGBoost integration.

import xgboost as xgb
import neptune
from neptune.integrations.xgboost import NeptuneCallback

xgb_classifer = xgb.XGBClassifier(**params_attempt3)
    eval_set=[(x_test, y_test)],
    callbacks=[NeptuneCallback(log_tree=[0, 1])])


Head back to Neptune to check the loss metrics and the Feature importance graph:

Charts of train and eval loss
Charts of train and eval loss: Neptune web UI
Feature importance graph
Feature importance graph for the 4 factors

We can even take a look at the Internal estimators of the model: the Graph of trees that XGBoost uses internally.

Internal XGBoost estimators
Internal XGBoost estimators (click to enlarge)

XGBoost Results:






ROC Curve and Confusion Matrix for the XGBoost performer

Decision tree

The advantage of decision trees is that they’re simple to interpret, very quick to train, non-parametric, and require very little data preprocessing.They can be calculated automatically, by supervised learning algorithms capable of automatically selecting the discriminating variables within unstructured and potentially large data.

We’ll be testing a light implementation of Decision Trees for this prepared dataset, and then benchmark the three models to see which performs better.

Train the model

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(), y_train)


preds = classifier.predict(x_test)
print('Accuracy Score: ', metrics.accuracy_score(y_test, preds))





ROC Curve and Confusion Matrix for the Decision Tree model

Benchmark of the three models

After training Logistic Regression (LR), Decision Tree (DT) and eXtreme Gradient Boosting (XGBoost) multiple times to avoid the single-result bias, the best performing iteration of each model will be used to do the comparison.


Logistic regression



Decision tree






We see that all the models are pretty close in terms of accuracy, even though XGBoost did better overall. The most interesting metric is the F1-Score which is better at evaluating the level of confusion of the models in terms of identifying the right classes. XGBoost did better at predicting the right classes and differentiating the “bad” and “good” debtors.


Throughout this tutorial, we did an overall tour of different aspects involved in data science work, like:

  • Data Preprocessing and Transformation,
  • Data pipeline automation,
  • Statistical Analysis and Dimensionality reduction,
  • ML development and training,
  • Benchmarking multiple model variants and picking the right one,
  • ML workflow using Neptune.

I hope this tutorial was useful to you, as I’ve designed it to fully cover different aspects of real data science use-cases.  If you feel that your thirst for knowledge still needs quenching go ahead check the references below. Have fun!


Was the article useful?

Thank you for your feedback!