MLOps Blog

Data Cleaning Process: How Should It Look Like?

8 min
18th August, 2023

Data cleaning is simply the process of preparing data for analysis by means of modifying, adding to or removing from it. This process is also commonly referred to as data preprocessing. It’s very important for data scientists and machine learning engineers to be very skilled in the area of data cleaning because all the insights they or their models will derive from the data is directly dependent on the quality of preprocessing done on the data.

In this blog post, we’d painstakingly go through the concept of data cleaning as far as categorical features are concerned.

Going through these points above, I try to explain as much as possible, give picture references to aid understanding, code demonstration and so on.

At the end of this article, you’ll not only know what data cleaning is, you’d also have a perfect grasp on the steps involved in data cleaning, how to go about them and above all how to get the best from your data. So, let’s ride on.

Why do we need to clean our data?

At times, data needs to be cleaned/ preprocessed before we can extract useful information from it. Most real-life data have so many inconsistencies such as missing values, non-informative features and so on, as such, there’s a constant need to clean our data before using it for us to get the best from it .

Take for example, the machine learning process works with only numbers. If we intend to use data that has text, datetime features and other non-numeric features, we need to find a way to represent them using numbers without losing the information they carry.

Say for example:

A simple logistic regression model simply maps a dependent variable to an independent variable using a linear function such as: y= wx + b

Where:

w is weight, could be any number.

and b is bias, which is also a number.

if you’re given a set of data; [small, medium, big], you find out that it’s impossible to compute: y=w*small+b

but if you encode small as 1, medium as 2, and big as 3 thereby transforming your data to [1, 2, 3], you discover that you’re not only able to computey= w*1 +b, but the information encoded in them is still retained. cases as this and more are our business in data cleaning and preprocessing.

What are the steps in data cleaning?

Data cleaning is just the collective name to a series of actions we perform on our data in the process of getting it ready for analysis. 

Some of the steps in data cleaning are:

  • Handling missing values
  • Encoding categorical features
  • Outliers detection
  • Transformations
  • etc.

Handling missing values

Data cleaning - missing values

One of the most common problems encountered while preprocessing data is the problem of missing values in our data. They are very common and could be as a result of ;

  1. Omission by the person filling the data either intentionally or unintentionally or it simply doesn’t apply to them.
  2. Omission by the person that inputted the data into the computer.

A data scientist if not careful could draw a wrong inference from a data that contains missing values and that is why we need to study this phenomenon and learn to effectively tackle it.

Early machine learning libraries such as scikit learn do not allow missing values to be passed into them. This posed a certain challenge as the data scientist is required to iterate over many approaches to tackling missing values before passing the data to the scikit learn ML model. This barrier has been lifted by recent machine learning models and platforms, most especially gradient boosting machines such as Xgboost, Catboost, LightGBM and many more.

 One approach I personally fancy is the Catboost’s method which allows you to choose among three options (Forbidden, Min, and Max). Forbidden treats missing values as errors while min sets it to less than all other values in the particular feature (column). This way, we are certain that during the fundamental decision tree splitting of the features, these missing values are also considered.

LightGBM and XGboost also handle missing values in pretty convenient ways. However, when preprocessing, try as many methods as possible. Handling yourself, using some libraries like we’ve discussed above, some other tools like automl, etc.

Read also

How to Organize Your LightGBM ML Model Development Process – Examples of Best Practices
How to Organize Your XGBoost Machine Learning (ML) Model Development Process – Best Practices

Missing values generally fall into three categories which are:

  1. Missing completely at random (MCAR): A missing value missing completely at random if we’ve no information, reason or anything whatsoever that can help compute it. For example, “I am a very sleepy graduate student who accidentally knocks coffee on some of the written surveys we’ve collected, so we lose all of the data that we otherwise would have had.”
  2. Missing at random (MAR): A missing value missing at random if we’ve information, reason or anything whatsoever (especially from other known values) that can help compute it. For example, “I administer a survey that includes a question about someone’s income. Those who are female are less likely to respond to the question about income.”
  3. Not missing at random (NMAR). the value of the variable that’s missing is related to the reason it’s missing. For example, “If I administer a survey that includes a question about someone’s income. Those with low incomes are significantly less likely to respond to that question”. Hence, we know why such data points maybe missing

How do we detect missing values in Python?

Before we handle missing values, it’s only logical that we learn how to detect them and decide on how to handle our data based on the volume of missing values, how much data we’ve, and so on. One implementation I enjoy and use often is to set a threshold for a column to decide if it’s repairable or beyond repair, lol. (50 – 60% does it for me).

Below, you’ll see a function that implements the idea we discussed in the previous paragraph.

def missing_removal(df, thresh, confirm= None):
    holder= {}
    for col in df.columns:
        rate= df[col].isnull().sum() / df.shape[0]
        if rate > thresh:
            holder[col]= rate
    if confirm==True:
        df.drop(columns= [i for i in holder], inplace= True)
        return df
    else:
        print(f'Number of columns that have Nan values above the thresh specified{len(holder)}')
        return holder

Quick note:

if confirm is set to true, all columns that have their missing values percentage to be higher than the threshold set are dropped and if confirm is set to None or False, the function returns a list of the missing values percentage for all columns in the data. Try it out on your data. It’s so cool.

Now that we’re able to detect the missing values in our data and have dropped the unredeemable ones and kept the redeemable ones, let’s go ahead to discuss further on how to actually fix them. 

There are two ways to go about this:

1. Statistical imputation:

This is a method that has proven to be effective over time. All you do is simply fill the missing data in a column with either the mean, median, mode of the column. This works guys, trust me.

Scikit-learn provides a subclass called SimpleImputer to handle our missing values this way.

Below is a short description that would help understand statistical imputation better.

from sklearn.impute import SimpleImputer
# store the columns of the dataframe
cols= df.columns

#instantiate the SimpleImputer subclass
#Depending on how you want to imput, you can use most_frequent, mean, median and constant
imputer= SimpleImputer(strategy= 'most_frequent')
df= imputer.fit_transform(df)

# convert back to dataframe
df= pd.DataFrame(df, columns= cols)

2. Multiple Imputation by Chained Equations (MICE):

In this method, a column and it’s missing values are modelled as a function of the other columns in the data. This process is repeated until the tolerance between the previously calculated value and the current one is so small and below a given threshold.

Scikit-learn provides a subclass called IterativeImputer to handle our missing values this way.

Below is a short description that hopefully solidifies this concept to you.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# store the columns of the dataframe
cols= df.columns

#instantiate IterativeImputer subclass
imputer= IterativeImputer(max_iter= 3)
df= imputer.fit_transform(df)

# convert back to dataframe
df= pd.DataFrame(df, columns= cols)

Hopefully, all we’ve discussed so far is enough for you to handle missing values effectively.

For more on missing values, check this awesome material by Matt Brems.

Encoding categorical feature

Data cleaning - encoding

What are categorical features?

Well, my friends, categorical features are features that take on discrete values only. They do not take on continuous values such as 3.45, 2.67 and so on. A categorical feature could have values like big, medium and small, 1- 5 as a ranking, Yes and No, 1 and 0 , yellow, red, blue, etc. They basically represent categories e.g age groups, countries, colors, gender, etc.

Oftentimes, they come in text format, other times, they come in number format (most people often fail to identify them in this format).

How to identify categorical features?

This shouldn’t be a big deal, most times, categorical features come in text format and that’s all you need to identify them. 

What if, it’s a form of ranking and they actually come in number format. Well, what to do is check the number of unique values in a column and compare it to the number of rows in the column. Say for example, a column that has 2000 rows and just 5 or 10 unique values, you most likely don’t need a sage to tell you that the column is a categorical column. There isn’t a rule to this, it’s merely intuition, you could be right or wrong. So just be free, try out stuff.

Classes of encoding

Given a categorical feature, as we’ve examined above, we’re faced with a problem of converting the unique categories in each feature to number whilst not losing the information encoded in them. There are various methods of encoding categorical features based on some observable features. 

There are also two classes of categorical features;

  1. Ordinal categorical features: There is an inherent order or relationship between the categories in this particular feature e.g size (small, big, medium), age groups and so on.
  2. Nominal categorical features: There’s no sensible order between the categories of the feature e.g countries, cities, names, etc.

The two classes above are approached differently. Some of the methods I use in approaching nominal categorical features are;

  • One-hot encoding
  • Frequency/count encoding
  • Target mean encoding
  • Ordered integer encoding
  • Binary encoding
  • Leave one out encoding
  • Weight of evidence encoding

For ordinal categorical features, we only use:

  • Label encoding or ordinal encoding

Now we’ll try to work through some of these methods one after the other. We’d also implement them in python where possible and use a python library called category_encoders.

You can install the library using pip install category_encoders.

One-hot encoding

One-hot encoding is one of the most effective ways of encoding nominal categorical features. This method creates a new binary column for each category in the column. Ideally we drop one of these columns to avoid collinearity between the columns, therefore, a feature with K unique categories will yield an extra K-1 columns in the data. 

A downside to this method is that it expands feature space when a feature has many unique categories or there many categorical features in the data.

One-hot encoding

The illustration above really explains the concept of one-hot encoding. Encoding features in this manner eliminates all forms of hierarchy.

An implementation of this method in python is described below:

import pandas as pd
data= pd.get_dummies(data, columns, drop_first= True)
#check https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html for more info

Frequency/count encoding

This method is equally very effective. It introduces hierarchy to nominal categorical features based on how often they occur in the feature (column). It’s very similar to count encoding in the sense that count encoding can take any value while frequency encoding is normalized to range 0 and 1.

Frequency count encoding

An implementation of this method in python is described below:

# Frequency encoding
# cols is the columns we wish to encode
#df is the dataFrame
def freq_enc(df, cols):
    for col in cols:
        df[col]= df[col].map(round(df[col].value_counts()/len(df),4))
    return df

# count encoding
def count_enc(df, cols):
    for col in cols:
        df[col]= df[col].map(round(df[col].value_counts()))
    return df

Target mean encoding

The idea behind this method is actually very brilliant. This method has something very unique about it in the sense that it uses the target column in its computation, this is something you find very rare in machine learning practices generally. Each category in the categorical feature is replaced by the mean of the target column for that category only.

The method is a very excellent one, but could lead to over-fitting if it encodes too much information about the target, hence, before using this method, we need to make sure this categorical feature isn’t highly correlated with the target column.

The test data are encoded with stored values from the train data through mapping.

Target mean encoding

An implementation of this method in python is described below:

# target mean encoding
#target_col is the name of the target column (str)
def target_mean_enc(df, cols, target_col):
    mean_holder= {}
    for col in cols:
        col_mean= {}
        cat= list(df[col].unique())
        for i in cat:
            data= df[df[col]== i]
            mean= np.mean(data[target_col])
            col_mean[i]= mean
        mean_holder[col]= col_mean
    return mean_holder

The function above returns a dictionary containing the mean values of the columns that are encoded. Then the dictionary is mapped on the data. See below:

Target apply GAN

Ordered integer encoding

This method is very similar to target mean encoding except that it goes a step further to rank the categories based on the magnitude of their target mean.

Ordered integer encoding

After implementing the ordered integer encoding, the dataframe looks like this:

Ordered integer encoding

An implementation of this method in python is described below:

def ordered_interger_encoder(data, cols, target_col):
    mean_holder= {}
    for col in cols:
        labels =  list(enumerate(data.groupby([col])[target_col].mean().sort_values().index))
        col_mean= {value:order for order,value in labels}
        mean_holder[col]= col_mean
    return mean_holder

Leave one out encoding

This method is also very similar to target mean encoding except that at every level, it calculates the target mean whilst leaving out that particular level. The target mean encoding is still used on the test data though (check for more).

 An implementation of this method in python is described below:

from category_encoders import leave_one_out
binary= leave_one_out(cols= ['STATUS'])
binary.fit(data)
train= binary.transform(train_data)
test= binary.transform(test_data)

Binary encoding

Binary encoding works uniquely, almost similar to one-hot encoding in the sense that it creates extra features also. Firstly, it assigns ranks to the unique features based on how they occur in the data (fret not!, this rank has no meaning). The ranks are then converted to the base two number system (binary). The individual numbers are then split into different columns.

A demo of this method using category_encoders library is described below:

from category_encoders import BinaryEncoder
binary= BinaryEncoder(cols= ['STATUS'])
binary.fit(data)
train= binary.transform(train_data)
test= binary.transform(test_data)

Weight of evidence encoding

This is a method that has been used in credit risk analysis for up to seven decades now. it is commonly used to transform features for a logistic regression task as it helps to reveal correlations among features that may not be visible to us before. This method can only be used in a classification task.

It simply transform categorical columns by applying ln(p(good)/p(bad)) to each category in the column.

where:

p(good) is one category of the target column say p(1) 

and p(bad) is the second category which could just be p(0).

Data cleaning - WOE

An implementation of this method using category_encoders is described below:

from category_encoders import WOEEncoder
binary= WOEEncoder(cols= ['STATUS'])
binary.fit(data)
train= binary.transform(train_data)
test= binary.transform(test_data)

Label encoding or ordinal encoding

This method is used for encoding nominal categorical features; we simply assign numbers to each category based on the magnitude we can infer from them.

Ordered integer encoding

An implementation of this method in python is described below:

df['size']= df['size'].map({'small':1, 'medium':2, 'big':3})
df

Data cleaning tools and libraries

During the course of the article, references were made to certain machine learning libraries such as scikit learn and category_encoders. Some other python libraries that aid with data preprocessing include: Numpy, Pandas, Seaborn, Matplotlib, Imblearn, and so on.

However, if you’re the type that doesn’t fancy writing too many codes, you could check out tools such as OpenRefine, Trifacta Wrangler, and so on.

Conclusion

The code implementation with clear examples to most of the concepts we’ve discussed so far can be found here.

I hope that this article was able to give you a strong insight into the concept of categorical encoding and how to fill missing values.

Which should you use? 

Data cleaning is an iterative process, hence try out as many of them as you like and stick to the one that works best for your data.

Was the article useful?

Thank you for your feedback!