According to the 2020 International Data Corporation’s forecast, 59 zettabytes of data would have been created, consumed, captured and copied in 2020. This forecast gets more interesting when we go back in time to 2012, and find that IDC forecasted the digital universe to reach only 40 zettabytes by 2020, and not in 2020 alone.
This wide gap between actual and forecasted numbers has its reasons, the biggest being the COVID-19 pandemic. Global quarantine sent everyone online, and data generation spiked insanely. Not surprisingly, the years 2020 to 2024 have already been named years of the COVID-19 Data Bump. This calls for people, process and technology that can manage and optimize the data to get profitable insights and get it fast.
Skills like data management and data processing are becoming extremely valuable. Which is why this article is a comprehensive guide for data preprocessing steps and techniques, to get you going in this new world.
What is data preprocessing?
A considerable chunk of any data-related project is about data preprocessing and data scientists spend around 80% of their time on preparing and managing data. Data preprocessing is the method of analyzing, filtering, transforming and encoding data so that a machine learning algorithm can understand and work with the processed output.
Why is data preprocessing necessary?
Algorithms that learn from data are simply statistical equations operating on values from the database. So, as the popular saying goes, “if garbage goes in, garbage comes out”. Your data project can only be successful if the data going into the machines is high quality.
In data extracted from real-world scenarios, there’s always noise and missing values. This happens due to manual errors, unexpected events, technical issues, or a variety of other obstacles. Incomplete and noisy data can’t be consumed by algorithms, because they’re usually not designed to handle missing values, and the noise causes disruption in the true pattern of the sample. Data preprocessing aims to solve these problems by thorough treatment of the data at hand.
How to go about data preprocessing?
Before we get into the details of step by step data-processing, let’s take a look at a few tools and libraries that can be used to execute and make data processing much more manageable!
Tools and libraries
Data preprocessing steps can be simplified through tools and libraries that make the process easier to manage and execute. Without certain libraries, one liner solutions can take hours of coding to develop and optimize.
Data Preprocessing with Python: Python is a programming language that supports countless open source libraries that can compute complex operations with a single line of code. For instance, for the smart imputation of missing values, one needs only use scikit learn’s impute library package. Or, for scaling datasets, just call upon the MinMaxScaler function from the preprocessing library package. There are countless data preprocessing functions available in the preprocessing library package.
Autumunge is a great tool, built as a python library platform, that prepares tabular data for direct application of machine learning algorithms.
Data Preprocessing with R: R is a framework that is mostly used for research/academic purpose. It has multiple packages which are just like python libraries and support data preprocessing steps considerably.
Data Preprocessing with Weka: Weka is a software that supports data mining and data preprocessing through in-built data preprocessing tools and machine learning models for intelligent mining.
Data Preprocessing with RapidMiner: Similar to Weka, RapidMiner is an open source software that has various efficient tools for supporting data preprocessing.
Now that we have the appropriate tools to support multiple functions, let’s dive deep into the data preprocessing steps.
Purpose of data preprocessing
After you have properly gathered the data, it needs to be explored, or assessed, to spot key trends and inconsistencies. The main goals of Data Quality Assessment are:
- Get data overview: understand the data formats and overall structure in which the data is stored. Also, find the properties of data like mean, median, standard quantiles and standard deviation. These details can help identify irregularities in the data.
- Identify missing data: missing data is common in most real-world datasets. It can disrupt true data patterns, and even lead to more data loss when entire rows and columns are removed because of a few missing cells in the dataset.
- Identify outliers or anomalous data: some data points fall far out of the predominant data patterns. These points are outliers, and might need to be discarded to get predictions with higher accuracies, unless the primary purpose of the algorithm is to detect anomalies.
- Remove Inconsistencies: just like missing values, real-world data also has multiple inconsistencies like incorrect spellings, incorrectly populated columns and rows (eg.: salary populated in gender column), duplicated data and much more. Sometimes, these inconsistencies can be treated through automation, but most often they need a manual check-up.
Below are some popular data pre-processing techniques that can help you meet the above goals:
Handling missing values
Missing values are a recurrent problem in real-world datasets because real-life data has physical and manual limitations. For example, if data is captured by sensors from a particular source, the sensor might stop working for a while, leading to missing data. Similarly, different datasets have different issues that cause missing data points.
We need to handle these missing values to optimally leverage available data. These are some tried and tested ways:
- Drop samples with missing values: this is instrumental when both the number of samples is high, and the count of missing values in one row/sample is high. This is not a recommended solution for other cases, since it leads to heavy data loss.
- Replace missing values with zero: sometimes this technique works for basic datasets, since the data in question assumes zero as a base number, signifying that the value is absent. However, in most cases, zero can signify a value in itself. For example, if a sensor generates temperature values and the dataset belongs to a tropical region. Similarly, in most cases, if missing values are populated with 0, then it would be misleading to the model. 0 can be used as replacement only when the dataset is independent of its effect. For example, in phone bill data, a missing value in the billed amount column can be replaced by zero, since it might indicate that the user didn’t subscribe to the plan that month.
- Replace missing value with mean, median or mode: you can deal with the above problem, resulting from using 0 incorrectly, by using statistical functions like mean, median or mode as a replacement for missing values. Even though they’re also assumptions, these values make more sense and are closer approximations when compared to one single value like 0.
- Interpolate the missing values: interpolation helps to generate values inside a range based on a given step size. For instance, if there are 9 missing values in a column between cells with values 0 and 10, interpolation will populate the missing cells with numbers from 1 to 9. Understandably, the dataset needs to be sorted according to a more reliable variable (like the serial number) before interpolation.
- Extrapolate missing values: extrapolation helps to populate values which are beyond a given range, like the extreme values of a feature. Extrapolation takes the help of another variable (usually the target variable) to compare the variable in question and populate it with a guided reference.
- Build a model with other features to predict the missing values: by far the most intuitive of all techniques we’ve mentioned. Here, an algorithm studies all the variables except the actual target variable (since that would lead to data leakage). The target variable for this algorithm becomes the feature with missing values. The model, if well trained, can predict the missing points and provide the closest approximations.
Different columns can be present in different ranges. For example, there can be a column with a unit of distance, and another with the unit of a currency. These two columns will have starkly different ranges, making it difficult for any machine learning model to reach an optimal computation state.
In more technical terms, if one considers using Gradient Descent, it will take longer for the gradient descent algorithm to converge, since it has to process different ranges that are far apart. The same is demonstrated in the figure below.
The diagram on the left has scaled features. This means that features are brought down to values which are comparable with one another, so the optimization function doesn’t have to take major leaps to reach the optimal point. Scaling is not necessary for algorithms (like the decision tree), which are not distance-based. Distance based models however, must have scaled features without any exception.
Some popular scaling techniques are:
- Min-Max Scaler: min-max scaler shrinks the feature values between any range of choice. For example, between 0 and 5.
- Standard Scaler: a standard scaler assumes that the variable is normally distributed and then scales it down so that the standard deviation is 1 and the distribution is centered at 0.
- Robust Scaler: robust scaler works best when there are outliers in the dataset. It scales the data with respect to the inter-quartile range after removing the median.
- Max-Abs Scaler: similar to min-max scaler, but instead of a given range, the feature is scaled to its maximum absolute value. The sparsity of the data is preserved since it does not center the data.
Scikit Learn’s preprocessing library package can provide a one liner solution to execute all the above scaling methods.
Outliers are data points that do not conform with the predominant pattern observed in the data. They can cause disruptions in the predictions by taking the calculations off the actual pattern.
Outliers can be detected and treated with the help of box-plots. Box plots are used to identify the median, interquartile ranges and outliers. To remove the outliers, the maximum and minimum range needs to be noted, and the variable can be filtered accordingly.
Sometimes, data is in a format that can’t be processed by machines. For instance, a column with string values, like names, will mean nothing to a model that depends only on numbers. So, we need to process the data to help the model interpret it. This method is called categorical encoding. There are multiple ways in which you can encode categories. Here are a few to get you started:
- Label/Ordinal Encoding: embeds values from 1 to n in an ordinal (sequential) manner. ‘n’ is the number of samples in the column. If a column has 3 city names, label encoding will assign values 1, 2 and 3 to the different cities. This method is not recommended when the categorical values have no inherent order, like cities, but it works well with ordered categories, like student grades.
- One hot encoding: when data has no inherent order, you can use one hot encoding. One hot encoding generates one column for every category and assigns a positive value (1) in whichever row that category is present, and 0 when it’s absent. The disadvantage of this is that multiple features get generated from one feature, making the data bulky. This is not a problem when you don’t have too many features.
- Binary Encoding: this solves the bulkiness of one hot encoding. Every categorical value gets converted to its binary representation, and for each binary digit a new column is created. This compresses the number of columns compared to one hot encoding. With 100 values in a categorical column, one hot encoding will create 100 (or 99) new columns, whereas binary encoding will create much less, unless the values are too large.
- BaseN Encoding: this is similar to binary encoding, with the only difference of base. Instead of base 2 as with binary, any other base can be used for baseN encoding. The higher the base number, the higher the information loss, but the encoder’s compression power will also keep increasing. A fair trade-off.
- Hashing: hashing means generating values from a category with the use of mathematical functions. It’s like one hot encoding (with a true/false function), but with a more complex function and fewer dimensions. There is some information loss in hashing due to collisions of resulting values.
These encoders borrow information from the target variable and map them to the categories in a column. This technique is of great use when the number of categories is significantly high. In such cases, if classic encoders are used, it will increase the dimensionality many-fold and trouble the model unnecessarily. When the number of features is extremely high, a problem called the curse of dimensionality reduces the efficiency of machine learning models since they’re still not adept enough to handle large volumes of features.
- Target Encoding: the mean of only those rows in the target variable that correspond to a certain category in the feature is mapped to that category. Data leakage and overfitting issues must be taken care of by keeping the test data separate.
- Weight of Evidence Encoding: weight of Evidence (WoE) is the measure of the extent to which an evidence (or a value) supports or negates a presumed hypothesis. WoE is usually used to encode continuous variables with the binning technique.
- Leave One Out Encoding: similar to target encoding, but it leaves the value of the current sample while calculating the mean. This helps in avoiding outliers and anomalous data.
- James-Stein Encoding: takes the weighted average of the corresponding target means along with the mean of the entire target variable. This helps to reduce both overfitting and underfitting. The weights are decided based on the estimated variance of values. If the variance is high for the values which make up a mean, it will indicate that that particular mean is not very reliable.
Feature creation and aggregation
New features can be created from raw features. For example, if two features called ‘total time’ and ‘total distance’ are available, you can create the feature of speed. This gives a new perspective to the model, which can now detect a logical relation between speed and the target variable.
Similarly, you can do this to build other intuitive features, like kilometers binned on the basis of weekdays and weekends, or speed during rush hour. In case of deep learning models, neural network layers can identify complex relationships between raw features, so we don’t need to feed formula-based features to DL models.
Similarly, features can be appropriately aggregated to reduce data bulk, and also to create relevant information. For instance, in a time series model for rain forecast, the data has to be aggregated based on the day so that the total measure of rain per day can be assessed. Several records of rain measurements throughout the day will not add much value to the time series model.
Machine learning models are not adept enough to handle a large number of features, following the rule of “garbage in, garbage out”. In their book, “An Introduction to Variable and Feature Selection,” Guyon and Elisseeff write:
The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.
Using irrelevant and redundant features will make the model unnecessarily complex, and might even lower prediction scores. The main advantages of feature selection are:
- Reduced processing time: the lesser the volume of data, the faster the training and prediction. More features mean more data for the model to learn from, and increased learning time.
- Improved Accuracy: when a model has no irrelevant variables to consider, irrelevant factors can’t enhance the model’s score.
- Reduced Overfitting: the lower the number of irrelevant variables and redundant data, the lesser the propagation of noise across the model’s decisions.
Feature selection can be conducted on many levels. Primarily, feature selection techniques can be segregated into Univariate and Multivariate techniques.
In every technique which comes under the hood of Univariate Selection, every feature is individually studied and the relationship it shares with the target variable is taken into account. Here are a few Univariate feature selection techniques:
Variance is the measure of change in a given feature. For example, if all the samples in a feature have the same values, it would mean that the variance of that feature is zero. It’s essential to understand that a column which doesn’t have enough variance is as good as a column with all ‘nan’ or missing values. If there’s no change in the feature, it’s impossible to derive any pattern from it. So, we check the variance and eliminate any feature that shows low or no variation.
Variance thresholds might be a good way to eliminate features in datasets, but in cases where there are minority classes (say, 5% 0s and 95% 1s), even good features can have very low variance and still end up being very strong predictors. So, be advised – keep the target ratio in mind and use correlation methods before eliminating features solely based on variance.
Correlation is a univariate analysis technique. It detects linear relationships between two variables. Think of correlation as a measure of proportionality, which simply measures how the increase or decrease of a variable affects the other variable.
One major disadvantage of correlation is that it can’t properly capture non-linear relationships, not even the strong ones. The images below demonstrate the pros and cons of correlation:
The second plot shows a strong sine-wave relationship between the dependent and independent variables. However, correlation (-0.389) can barely capture the strong dependence. On the other hand, we get a high correlation score (0.935) when there’s a linear dependency, even though it’s not as strong as the sine-dependency.
There are various correlation techniques, but essentially all of them try to track down the presence of linear relationships between two variables. There are three popular types of correlation techniques:
- Pearson Correlation: this is the simplest of all and comes with a lot of assumptions. The variables in question must be normally distributed, and have a linear relationship with each other. Also, the data must be equally distributed about the regression line. However, in spite of several assumptions, Pearson correlation works well for most data.
- Spearman Rank Correlation: it’s based on the assumption that variables are measured on an ordinal (rank-wise) scale. It tracks the variability of a variable that can be mapped to the other variable under observation.
- Kendall Rank Correlation: this method is a measure of the dependence between two variables and follows mostly the same assumption as Spearman’s, with the exception of measuring correlation based on probability instead of variability. It’s the difference between the probability that the two variables under observation are in order, and the probability that they’re not.
You have to do correlation tests not only with respect to the target (or dependent) variable, but also with respect to all the independent variables. For instance, if there are two variables which have a correlation of 90% and 85% with the target respectively, and a correlation of 95% with respect to each other, then it will be beneficial to drop either of the variables. Preferably the one with lower correlation with the target. You need to do this to get rid of redundant information in the data, so that the model is less biased.
Mutual information solves the problem caused by correlation. It effectively captures any non-linear relationship between given variables. This helps us eliminate the features which show no significant relationship with the target variable, helping us strengthen the predictive model. The idea behind mutual information is information gain. It asks the question: how much information on one variable can be extracted from another? Or, how much movement (increase or decrease) of a variable can be tracked using another variable?
For the same relationships, sine-wave and linear, mutual information scores are as follows:
Mutual information correctly suggests a strong sine-wave relationship, making it more competent when it comes to capturing information about non-linear relationships. Mutual information for sine relationship is 0.781, whereas correlation for the same was -0.389.
This is a statistical tool, or test, which can be used on groups of categorical features to evaluate the likelihood of association, or correlation, with the help of frequency distributions.
Also referred to as Wrapper methods, multivariate selection techniques take a group of features at a time and test the group’s competence in predicting the target variable.
Forward Selection: this method starts with a minimal number of features and measures how well the set performs at prediction. With every iteration, it adds another variable, selected on the basis of best performance (compared to all other variables). The set with the best performance among all sets is finalized as the feature set.
Backwards Elimination: similar to forward selection, but in reverse direction. Backward elimination starts with all possible features and measures performance at every iteration, eliminating extraneous variables, or variables performing poorly when paired with the feature set.
Backward Elimination is usually the preferred method compared to forward selection. This is because in forward selection, the suppression effect can get in the way. Suppression effect occurs when one variable can be utilized optimally only when another variable is held constant. In other words, with a newly added feature, it might so happen that an already existing feature in the set is rendered insignificant. To avoid this, p-value can be used, even though it’s a somewhat controversial statistical tool.
Recursive Feature Elimination: similar to backward elimination, but it replaces the iterative approach with a recursive approach. It’s a greedy optimization algorithm, and works with a model to find the optimized feature set.
Linear Discriminant Analysis (LDA): helps you find a linear combination of features that separates two or more classes of a categorical variable.
ANOVA: aka ‘analysis of variance’, it’s similar to LDA, but uses one or more categorical features and one continuous target. It’s a statistical test for whether the means of different groups are similar or not.
Embedded methods: some machine learning models come with in-built feature selection methods. They’re engineered to apply coefficients to features based on the performance of the features in terms of predicting the target. Poorly performing variables get very low or zero as their coefficient, which omits them from the learned equation (almost) entirely. Examples of embedded methods are Ridge and Lasso models, which are linear models used in regression problems.
With that, we’ve reached the end of this guide. Please note that every step mentioned here has several more subtopics that deserve their own articles. The steps and techniques we’ve explored are the most used and popular working methods.
Once you’re done with data preprocessing, your data can be split into training, testing and validation sets for model fitting and model prediction phases. Thanks for reading, and good luck with your models!
- Extrapolation and Interpolation
- Statistical Intro: Mean, Median and Mode
- Gradient Descent
- Types of Category Encoders
- Correlation Types
- Mutual Information
- Univariate vs Multivariate Analysis
- Feature Selection Methods
- Linear Discriminant Analysis