Blog » MLOps » The Best Feature Engineering Tools

The Best Feature Engineering Tools

When it comes to predictive models, the dataset always needs a good description. In the real world, datasets are raw and need plenty of work. If the model is to understand a dataset for supervised or unsupervised learning, there are several operations you need to perform and this is where feature engineering comes in.

In this article, we’ll discuss:

  • What is feature engineering
  • Types of problem in feature engineering
  • Open source tools for feature engineering
  • Comparison of feature engineering tools

Feature engineering examples

Let’s start with a couple of examples. Here, we have a categorical feature column with certain fruit: ‘banana’, ‘pineapple’ and ‘unknown’. We can label encode it:

Fruits
banana
pineapple
banana
unknown
Fruits
1
2
1
0

However, linear predictive models like decisions tree would understand this feature better if we decompose it to three different features, one-hot encoding them:

Fruits
banana
pineapple
banana
unknown
f_banana f_pineapple f_unknown
1 0 0
0 1 0
1 0 0
0 0 1

In the last example we used a feature which made no sense to the machine learning algorithm and transformed it to numbers. Now in the second example we’ll perform a more complex operation. Let’s take the famous titanic dataset. In the titanic dataset, based on certain attributes, we define if titanic passengers survived or not. 

We have a column called ‘Name’. Names have titles like ‘mr.’, ‘mrs.’, ‘lord’, or ‘master’, which might have impacted the survival of a person. We can use the information and engineer a new feature, based on titles in passenger names. 

Let’s see how we can do this with a small block of code:

import pandas as pd
import re

df = pd.read_csv('./train.csv')

names = list()
_ = [names.extend(i.split()) for i in df['Name']]

names = [' '.join(re.findall(r"[a-zA-Z]+", name)).lower() for name in names]

seen_titles = set()
titles = [x for x in names if x not in seen_titles and not seen_titles.add(x)]

counts = dict()
for title in titles:
    counts[title] = names.count(title)

print({i: counts[i] for i in counts if counts[i]>50})

Output:

{'miss': 182, 'mr': 521, 'mrs': 129, 'william': 64}

We can see the occurrences of ‘miss’, ‘mr’ and ‘mrs’ is high, we can use this information, use the three titles and unknown (or if you want to do more work, find titles like master, lord, etc.) and make a categorical feature out of this. High number of occurrences of these titles means more data points have these values, which means there can be some relation between these titles and target columns. Also we can deduce that females had a higher survival rate or a person with the title ‘lord’ was more probable to survive. And hence this attribute which was just a bunch of names, is now an important feature.

Now that you’ve seen it in practice, let’s move on to a bit of theory behind feature engineering.

What is feature engineering?

Feature Engineering is the art of creating features from raw data, so that predictive models can deeply understand the dataset and perform well on unseen data. Feature engineering is not a generic method that you can apply on all datasets in the same way. Different datasets require different approaches. 

The representation of datasets for machine learning algorithms is different in each case. In case of images, important features can be shapes, lines, and edges. For audio, it can be certain words that make a difference. 

A good example of engineering features in images can be autoencoders, where they actually learn automatically what kind of features will the model understand best. Autoencoders input the images and the output is the same image, so the layers in between learn the latent representation of those images. These latent representations are better understood by neural networks and can be used to train better models.

Types of problems in feature engineering

Before going into tools for feature engineering, we’ll look at some of the operations that we can perform. Just remember that the best approach depends on the problem statement.

Feature extraction:

Feature extraction is the process of making new features which are composite of the existing ones. One of the great example of Feature Extraction is dimensionality reduction.

There can be millions of features in a dataset with audio, images, or even a tabular one. While a lot of them can be redundant, there is also the problem of model complexity. 

For some machine learning algorithms, the training time complexity increases exponentially as the number of features grows. In this case, we use feature extraction or dimensionality reduction.

There are algorithms like PCA, TSNE, and others that can be used to reduce feature dimensionality. They aggregate different features by using mathematical operations, while trying to keep the information intact. 

Let’s see an example of feature extraction while using PCA in Scikit-learn:

import pandas as pd
from sklearn.decomposition import PCA

df = pd.DataFrame([[2,4,6,8], [4,8,12,16]])
print(df)

0 1 2 3
0 2 4 6 8
1 4 8 12 16
 dr = PCA(n_components=2)
reduced_df = dr.fit_transform(df)
print(reduced_df)

array([[ 5.47722558e+00,  6.66133815e-16],
       [-5.47722558e+00,  6.66133815e-16]])

In the code above we used PCA to reduce the dimension of the above dataframe from 4 to 2. 

Feature selection:

Some features are more important, and others are so redundant that they don’t affect the model at all. We can score them based on a chosen metric, and arrange them in order of importance. Then, eliminate the unimportant ones. 

This can also be a recursive process where, after feature selection, we train the model, calculate the accuracy score, and then do feature selection again. We can iterate until we find the final number of features to keep in the dataset. The process is called recursive feature selection.

Some of the commonly used feature scoring functions are: 

  • F-score, 
  • mutual information score, 
  • Chi-square score. 

F-score can find the linear relation between feature and target columns, and create scores accordingly. Using scores for each feature, we can eliminate the ones with a lower F-score. Similarly, mutual information score can capture both linear and non-linear relationships between feature and target column, but needs more samples.

Chi square is a test used in statistics to test the independence of two events. A lower value of chi square suggests that the two variables(feature and target) are independent. Higher values for two variables means dependent hence important features.

The above are univariate feature selection algorithms. There are also algorithms based on trees or lasso regression, which can be used to calculate impurity-based feature importance.

Features can also be dropped based on the correlation between them. If two features are highly correlated, it makes sense to drop one of them as we’ll reduce the dimensionality of the dataset. 

Now, let’s look at a really simple F-score example:

import pandas as pd
df = pd.DataFrame([[1,12,2], [2, 34, 4], [3,87,6] ])
print(df)
0 1 2
0 1 12 2
1 2 34 4
2 3 87 6
from sklearn.feature_selection import f_regression
scores, _ = f_regression(df.iloc[:,0:2], df.iloc[:,-1])
print(scores)
[4.50359963e+15 1.75598335e+01]

We can see the huge difference between F-scores for columns ‘0’ and ‘1’ with respect to the target column ‘2’. And you can see how I created the dataframe, each value in column ‘0’ is half of each respective value in column ‘2’. However column ‘1’ contains some random values. F-score is high between column ‘0’ and target col ‘2’ but low between col ‘1’ and col ‘2’. We can say col ‘0’ better defines the target col ‘2’ therefore a higher score.

Feature construction:

Some features make sense to predictive models after some work, like we saw in the first and second example. This is called feature construction. It involves constructing more powerful features from the existing features in a dataset.

For example, we might have the domain knowledge for some feature that if the value is high enough, it falls into a different category than if it’s lower. 

Let’s say we have the count of trees in an area, with a maximum number of trees at 300. We can categorize: 

  • 0-100 trees as 1, 
  • 101-200 trees as 2,
  • 201-300 trees as 3. 

Categorizing them like this would remove the noise.

We can aggregate features or decompose them (like we did with one-hot encoding). Either way, we are creating new, better features out of the existing ones.

Tools for feature engineering

If you’re working on a very specific problem set, for a dataset in a dedicated project, then I would suggest you to manually work on data. But for generic problems, not everyone has the time to sit and engineer features. So, in this section we’ll look at some of the tools that automate feature engineering.

Featuretools

Feature tools

One of the most popular libraries for automated feature engineering. It supports a lot of functionalities, including:

  • feature selection, 
  • feature construction, 
  • using relational databases to create new features,
  • etc.

Apart from these it provides a whole lot of primitives, which are basic transformations using, max, sum, mode, and so on. These are useful operations. Say you have to find the mean time between events from a log file, you can use the primitives to do that.

But one of the most important aspects of featuretools is that it uses deep feature synthesis (DFS) to construct features. 

Let’s understand what DFS is. This algorithm needs entities. Think of entities as multiple interconnected data tables. Then it stacks primitives, and performs transformations on the columns. 

These operations can mimic the kind of transformations that humans do. The length of the stack of primitives is considered as the depth, hence the name deep feature synthesis. Let’s understand with an example:

Example of DFS
Fig. 1 – An example of DFS in action | Source

In this figure, you can see we start with the column where price is defined for “ProductID”. The first operation connects the existing table to the table with “OrderID”. Now, all the “OrderID”s are unique in the next transformation while ‘sum’ primitive is used and “CustomerID” are picked from another table for each respective “OrderID”. And in the third transformation, “OrderID”s are removed and “CustomerID”s are made unique, using the average operation which gives an average price for each customer. Here, DFEAT are the direct features and RFEAT are the relational features.

This is a great library to create baseline models, it can mimic what humans do manually. Once the baseline is achieved you would know the direction you want to move in.

Let’s solve one example using DFS to understand the Featuretools API:

import featuretools as ft
es = ft.demo.load_retail()
print(es)

 
Entityset: demo_retail_data
  Entities:
    order_products [Rows: 1000, Columns: 7]
    products [Rows: 606, Columns: 3]
    orders [Rows: 67, Columns: 5]
    customers [Rows: 50, Columns: 2]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_name -> customers.customer_name

I loaded the load_retail data from Featuretools. Now that we have the entity set, let’s apply DFS and get some new features:

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="orders",
                                      agg_primitives=["sum", "mean"],
                                      max_depth=3)
print(feature_matrix)

target_entity argument defines for which entity we would create new features. And agg_primitives are the transformations which will be applied. More depth means more features. You can use feature selection after this to find the best features.

Featuretools is by far the best feature engineering tool I’ve come across. There are many papers on various different methods, but most of them don’t have open source code implemented yet.

AutoFeat

Autofeat is another good feature engineering open-source library. It automates feature synthesis, feature selection, and fitting a linear machine learning model. 

The algorithm behind Autofeat is quite simple. It generates non-linear features, for example log(x), x2, or x3. And different operands are used like negative, positive and decimals, while creating the feature space. This results in exponential growth in the feature space. The categorical features are converted into one-hot encoded features.

Now that we have so many features, it’s necessary to select important features. First Autofeat removes the highly correlated features, so now it relies on L1 regularization and removes the feature with low coefficient(features with low weights after training linear/logistic regression with L1 regularization). This process of selecting correlated features and removing the features with L1 regularization is repeated several times until only a few features are left. These features are selected through this iterative process which actually describes the dataset.

Refer to the notebook in this link for an example of AutoFeat.

TSFresh

TSFresh

Next on our list is TSFresh, a library focused on time series data. It includes both feature synthesis and feature selection. The library contains more than 60 feature extractors. These operations include Global Maximum, Standard Deviation, Fast Fourier transformation, etc. The transformations can turn 6 original features to 1200 features. Which is why a feature selector is also given in the library, which removes the redundant features. The library is really useful for time series data.

You can find a good quick start in the documentation.

FeatureSelector

Feature Selector is a Python library for feature selection. It’s a small library with pretty basic options. It identifies feature importance based on missing values, single unique values, collinear features, zero importance and low importance features. It uses tree-based learning algorithms from ‘lightgbm’ for calculating feature importance. The library also includes a number of visualization methods, which can help you get more insights about the dataset.

Here’s a link to the example code for the library.

OneBM

OneBM, or One Button Machine, works with relational data. It starts by joining different tables incrementally, and identifies the type of features, for example time series, categorical, or numerical. Then it applies a set of pre-defined feature engineering operations.

The downside is that there is no open-source implementation for OneBM.

Cognito

Promising in theory, but unfortunately no open-source code available. The concept here is quite similar to TSFresh, it applies a bunch of transformations recursively on features. When this exponentially increases the dimension of data, feature selection is used.

Comparison

To finish, let’s compare these libraries so you can see which will fit your work:

Tools/Measures Support for Type of Databases Feature Engineering Feature Selection Open Source Implementation Support for Time Series
Featuretools Relational Tables Yes Yes Yes Yes
AutoFeat Single Table Yes Yes Yes No
TSFresh Single Table Yes Yes Yes Yes
FeatureSelector Single Table No Yes Yes No
OneBM Relational Tables Yes Yes No Yes
Cognito Single Table Yes Yes No No

Featuretools can fulfill most of your requirements. TSFresh works specifically on time series data, so I would prefer to use it while working with such datasets. 

Conclusion

I hope that now you understand feature engineering, and know which tools you want to try out next. 

Feature engineering is still one of those problems that are hard to automate. Even though there are libraries, the best results are achieved when features are engineered manually. Feature engineering is usually the least discussed problem, but it’s a really important one. 

It’s difficult to understand the underlying descriptions of features that predictive models understand. Autoencoders and restricted boltzmann machines are a step towards understanding the features that models understand. The future will surely bring interesting developments in this area.

Here are some additional resources for you to check out:

Papers

Code

Deep Learning Engineer at Curl Analytics

READ NEXT

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

  • use different models and model hyperparameters
  • use different training or evaluation data, 
  • run different code (including this small change that you wanted to test quickly)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics. 

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.  

This is where ML experiment tracking comes in. 

Continue reading ->
MLOps

MLOps: What It Is, Why it Matters, and How To Implement it (from a Data Scientist Perspective)

Read more

The Best MLOps Tools You Need to Know as a Data Scientist

Read more
MLOps

Experiment Tracking vs Machine Learning Model Management vs MLOps

Read more
Best tools featured

15 Best Tools for Tracking Machine Learning Experiments

Read more