We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Feature Selection Methods and How to Choose Them

Have you ever found yourself sitting in front of the screen wondering what kind of features will help your machine learning model learn its task best? I bet you have. Data preparation tends to consume vast amounts of data scientists’ and machine learning engineers’ time and energy, and making the data ready to be fed to the learning algorithms is no small feat. 

One of the crucial steps in the data preparation pipeline is feature selection. You might know the popular adage: garbage in, garbage out. What you feed your models with is at least as important as the models themselves, if not more so.

In this article, we will:

  • look at the place of feature selection among other feature-related tasks in the data preparation pipeline
  • and discuss the multiple reasons why it is so crucial for any machine learning project’s success.
  • Next, we will go over different approaches to feature selection and discuss some tricks and tips to improve their results.
  • Then, we will take a glimpse behind the hood of Boruta, the state-of-the-art feature selection algorithm, to check out a clever way to combine different feature selection methods
  • And we’ll look into how feature selection is leveraged in the industry.

Let’s dive in!

What is feature selection, and what is it not?

Let’s kick off by defining our object of interest. 

What is feature selection? In a nutshell, it is the process of selecting the subset of features to be used for training a machine learning model. 

This is what feature selection is, but it is equally important to understand what feature selection is not – it is neither feature extraction/feature engineering nor it is dimensionality reduction.

Feature extraction and feature engineering are two terms describing the same process of creating new features from the existing ones based on domain knowledge. This yields more features than were originally there, and it should be performed before feature selection. First, we can do feature extraction to come up with many potentially useful features, and then we can perform feature selection in order to pick the best subset that will indeed improve the model’s performance.

Dimensionality reduction is yet another concept. It is somewhat similar to feature selection as both aim at reducing the number of features. However, they differ significantly in how they achieve this goal. While feature selection chooses a subset of original features to keep and discards others, dimensionality reduction techniques create projections of original features onto a fewer-dimensional space, thus creating a completely new set of features. Dimensionality reduction, if desired, should be run after feature selection, but in practice, it is either one or the other.

Now we know what feature selection is and how it corresponds to other feature-related data preparation tasks. But why do we even need it?

7 reasons why we need feature selection

A popular claim is that modern machine learning techniques do well without feature selection. After all, a model should be able to learn that particular features are useless, and it should focus on the others, right? 

Well, this reasoning makes sense to some extent. Linear models could, in theory, assign a weight of zero to useless features, and tree-based models should learn quickly not to make splits on them. In practice, however, many things can go wrong with training when the inputs are irrelevant or redundant – more on these two terms later. On top of this, there are many other reasons why simply dumping all the available features into the model might not be a good idea. Let’s look at the seven most prominent ones.

1. Irrelevant and redundant features

Some features might be irrelevant to the problem at hand. This means they have no relation with the target variable and are completely unrelated to the task the model is designed to solve. Discarding irrelevant features will prevent the model from picking up on spurious correlations it might carry, thus fending off overfitting.

Redundant features are a different animal, though. Redundancy implies that two or more features share the same information, and all but one can be safely discarded without information loss. Note that an important feature can also be redundant in the presence of another relevant feature. Redundant features should be dropped, as they might pose many problems during training, such as multicollinearity in linear models.

2. Curse of dimensionality

Feature selection techniques are especially indispensable in scenarios with many features but few training examples. Such cases suffer from what is known as the curse of dimensionality: in a very high-dimensional space, each training example is so far from all the other examples that the model cannot learn any useful patterns. The solution is to decrease the dimensionality of the features space, for instance, via feature selection.

3. Training time

The more features, the more training time. The specifics of this trade-off depend on the particular learning algorithm being used, but in situations where retraining needs to happen in real-time, one might need to limit oneself to a couple of best features.

4. Deployment effort

The more features, the more complex the machine learning system becomes in production. This poses multiple risks, including but not limited to high maintenance effort, entanglement, undeclared consumers, or correction cascades.

5. Interpretability

With too many features, we lose the explainability of the model. While not always the primary modeling goal, interpreting and explaining the model’s results are often important and, in some regulated domains, might even constitute a legal requirement. 

6. Occam’s Razor

According to this so-called law of parsimony, simpler models should be preferred over the more complex ones as long as their performance is the same. This also has to do with the machine learning engineer’s nemesis, overfitting. Less complex models are less likely to overfit the data.

7. Data-model compatibility

Finally, there is the issue of data-model compatibility. While, in principle, the approach should be data-first, which means collecting and preparing high-quality data and then choosing a model which works well on this data, real life may have it the other way around. 

You might be trying to reproduce a particular research paper, or your boss might have suggested using a particular model. In this model-first approach, you might be forced to select features that are compatible with the model you set out to train. For instance, many models don’t work with missing values in the data. Unless you know your imputation methods well, you might need to drop the incomplete features.

Different approaches to feature selection

All the different approaches to feature selection can be grouped into four families of methods, each coming with its pros and cons. There are unsupervised and supervised methods. The latter can be further divided into the wrapper, filter, and embedded methods. Let’s discuss them one by one.

Different approaches to feature selection
Feature selection methods | Source: author

Unsupervised feature selection methods

Just like unsupervised learning is the type of learning that looks for patterns in unlabeled data, similarly, unsupervised feature selection methods are such methods that do not make use of any labels. In other words, they don’t need access to the target variable of the machine learning model. 

How can we claim a feature to be unimportant for the model without analyzing its relation to the model’s target, you might ask. Well, in some cases, this is possible. We might want to discard the features with:

  • Zero or near-zero variance. Features that are (almost) constant provide little information to learn from and thus are irrelevant.
  • Many missing values. While dropping incomplete features is not the preferred way to handle missing data, it is often a good start, and if too many entries are missing, it might be the only sensible thing to do since such features are likely inconsequential.
  • High multicollinearity; multicollinearity means a strong correlation between different features, which might signal redundancy issues.

Unsupervised methods in practice

Let’s now discuss the practical implementation of unsupervised feature selection methods. Just like most other machine learning tasks, feature selection is served very well by the scikit-learn package, and in particular by its `sklearn.feature_selection` module. However, in some cases, one needs to reach out to other places. Here, as well as for the remainder of the article, let’s denote an array or data frame by `X` with all potential features as columns and observation in rows and the targets vector by `y`.

  • The `sklearn.feature_selection.VarianceThreshold` transformer will by default remove all zero-variance features. We can also pass a threshold as an argument to make it remove features whose variance is lower than the threshold.
from sklearn.feature_selection import VarianceThreshold


sel = VarianceThreshold(threshold=0.05)
X_selection = sel.fit_transform(X)
  • In order to drop the columns with missing values, pandas’ `.dropna(axis=1)` method can be used on the data frame.
X_selection = X.dropna(axis=1)
  • To remove features with high multicollinearity, we first need to measure it. A popular multicollinearity measure is the Variance Inflation Factor or VIF. It is implemented in the statsmodels package.
from statsmodels.stats.outliers_influence import variance_inflation_factor


vif_scores = [variance_inflation_factor(X.values, feature)for feature in range(len(X.columns))]

By convention, columns with a VIF larger than 10 are considered as suffering from multicollinearity, but another threshold may be chosen if it seems more reasonable.

Wrapper feature selection methods

Wrapper methods refer to a family of supervised feature selection methods which uses a model to score different subsets of features to finally select the best one. Each new subset is used to train a model whose performance is then evaluated on a hold-out set. The features subset which yields the best model performance is selected. A major advantage of wrapper methods is the fact that they tend to provide the best-performing feature set for the particular chosen type of model. 

At the same time, however, it has a limitation. Wrapper methods are likely to overfit to the model type, and the feature subsets they produce might not generalize should one want to try them with a different model.

Another significant disadvantage of wrapper methods is their large computational needs. They require training a large number of models, which might require some time and computing power. 

Popular wrapper methods include:

  • Backward selection, in which we start with a full model comprising all available features. In subsequent iterations, we remove one feature at a time, always the one that yields the largest gain in a model performance metric, until we reach the desired number of features.
  • Forward selection, which works in the opposite direction: we start from a null model with zero features and add them greedily one at a time to maximize the model’s performance.
  • Recursive Feature Elimination, or RFE, which is similar in spirit to backward selection. It also starts with a full model and iteratively eliminates the features one by one. The difference is in the way the features to discard are chosen. Instead of relying on a model performance metric from a hold-out set, RFE makes its decision based on feature importance extracted from the model. This could be feature weights in linear models, impurity decrease in tree-based models, or permutation importance (which is applicable to any model type).

Wrapper methods in practice

When it comes to wrapper methods, scikit-learn has got us covered:

  • Backward and forward feature selection can be implemented with the SequentialFeatureSelector transformer. For instance, in order to use the k-Nearest-Neighbor classifier as the scoring model in forward selection, we could use the following code snippet:
from sklearn.feature_selection import SequentialFeatureSelector

knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=3, direction=”forward”)
sfs.fit(X, y)
X_selection = sfs.transform(X)
  • Recursive Feature Elimination is implemented in a very similar fashion. Here is a snippet implementing RFE based on feature importance from a Support Vector Classifier.
from sklearn.feature_selection import RFE

svc = SVC(kernel="linear")
rfe = RFE(svc, n_features_to_select=3)
rfe.fit(X, y)
X_selection = rfe.transform(X)

Filter feature selection methods

Another member of the supervised family is filter methods. They can be thought of as a simpler and faster alternative to wrappers. In order to evaluate the usefulness of each feature, they simply analyze its statistical relation with the model’s target, using measures such as correlation or mutual information as a proxy for the model performance metric.

Not only filter methods faster than wrappers, but they are also more general since they are model-agnostic; they won’t overfit to any particular algorithm. They are also pretty easy to interpret: a feature is discarded if it has no statistical relationship to the target.

On the other hand, however, filter methods have one major drawback. They look at each feature in isolation, evaluating its relation to the target. This makes them prone to discarding useful features that are weak predictors of the target on their own but add a lot of value to the model when combined with other features.

Filter methods in practice

Let’s now take a look at implementing various filter methods. These will need some more glue code to implement. First, we need to compute the desired correlation measure between each feature and the target. Then, we would sort all features according to the results and keep the desired number (top-K or top-30%) of the ones with the strongest correlation. Luckily, scikit-learn provides some utilities to help in this endeavour.

  • To keep the top 2 features with the strongest Pearson correlation with the target, we can run:
from sklearn.feature_selection import r_regression, SelectKBest

X_selection = SelectKBest(r_regression, k=2).fit_transform(X, y)
  • Similarly, to keep the top 30% of features, we would run:
	from sklearn.feature_selection import r_regression, SelectPercentile

	X_selection = SelectPercentile(r_regression, percentile=30).fit_transform(X, y)

The `SelectKBest` and `SelectPercentile` methods will also work with custom or non-scikit-learn correlation measures, as long as they return a vector of length equal to the number of features, with a number for each feature denoting the strength of its association with the target. Let’s now take a look at how to calculate all the different correlation measures out there (we will discuss what they mean and when to choose which later).

  • Spearman’s Rho, Kendall Tau, and point-biserial correlation are all available in the scipy package. This is how to get their values for each feature in X.
from scipy import stats

rho_corr = [stats.spearmanr(X[:, f], y).correlation for f in range(X.shape[1])]

tau_corr = [stats.kendalltau(X[:, f], y).correlation for f in range(X.shape[1])]

pbs_corr = [stats.pointbiserialr(X[:, f], y).correlation for f in range(X.shape[1])]
  • Chi-Squared, Mutual Information, and ANOVA F-score are all in scikit-learn. Note that mutual information has a separate implementation, depending on whether the target is nominal or not.
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import f_classif

chi2_corr = chi2(X, y)[0]
f_corr = f_classif(X, y)[0]
mi_reg_corr = mutual_info_regression(X, y)
mi_class_corr = mutual_info_classif(X, y)
  • Cramer’s V can be obtained from a recent scipy version (1.7.0 or higher).
from scipy.stats.contingency import association

v_corr = [association(np.hstack([X[:, f].reshape(-1, 1), y.reshape(-1, 1)]), method="cramer") for f in range(X.shape[1])]

Embedded feature selection methods

The final approach to feature selection we will discuss is to embed it into the learning algorithm itself. The idea is to combine the best of both worlds: speed of the filters, while getting the best subset for the particular model just like from a wrapper.

Embedded methods in practice

The flagship example is the LASSO regression. It is basically just regularized linear regression, in which feature weights are shrunk towards zero in the loss function. As a result, many features end up with weights of zero, meaning they are discarded from the model, while the rest with non-zero weights are included.

The problem with embedded methods is that there are not that many algorithms out there with feature selection built-in. Another example next to LASSO comes from computer vision: auto-encoders with a bottleneck layer force the network to disregard some of the least useful features of the image and focus on the most important ones. Other than that, there aren’t many useful examples.

Filter feature selection methods: useful tricks & tips

As we have seen, wrapper methods are slow, computationally heavy, and model-specific, and there are not many embedded methods. As a result, filters are often the go-to family of feature selection methods. 

At the same time, they require the most expertise and attention to detail. While embedded methods work out of the box and wrappers are fairly simple to implement (especially when one just calls scikit-learn functions), filters ask for a pinch of statistical sophistication. Let us now turn our attention to filter methods and discuss them in more detail.

Filter methods need to evaluate the statistical relationship between each feature and the target. As simple as it may sound, there’s more to it than meets the eye. There are many statistical methods to measure the relationship between two variables. To know which one to choose in a particular case, we need to think back to our first STATS101 class and brush up on data measurement levels.

Data measurement levels

In a nutshell, a variable’s measurement level describes the true meaning of the data and the types of mathematical operations that make sense for these data. There are four measurement levels: nominal, ordinal, interval, and ratio.

Tabel with data measurement levels
Data measurement levels | Source
  • Nominal features, such as color (“red”, “green” or “blue”) have no ordering between the values; they simply group observations based on them. 
  • Ordinal features, such as education level (“primary”, “secondary”, “tertiary”) denote order, but not the differences between particular levels (we cannot say that the difference between “primary” and “secondary” is the same as the one between “secondary” and “tertiary”). 
  • Interval features, such as temperature in degrees Celsius, keep the intervals equal (the difference between 25 and 20 degrees is the same as between 30 and 25). 
  • Finally, ratio features, such as price in USD, are characterized by a meaningful zero, which allows us to calculate ratios between two data points: we can say that $4 is twice as much as $2.

In order to choose the right statistical tool to measure the relation between two variables, we need to think about their measurement levels.

Measuring correlations for various data types

When the two variables we compare, i.e., the feature and the target, are both either interval or ratio, we are allowed to use the most popular correlation measure out there: the Pearson correlation, also known as Pearson’s r

This is great, but Pearson correlation comes with two drawbacks: it assumes both variables are normally distributed, and it only measures the linear correlation between them. When the correlation is non-linear, Pearson’s r won’t detect it, even if it’s really strong. 

You might have heard about the Datasaurus dataset compiled by Alberto Cairo. It consists of 13 pairs of variables, each with the same very weak Pearson correlation of -0.06. As it quickly becomes obvious once we plot them, the pairs are actually correlated pretty strongly, albeit in a non-linear way.

The Datasaurus dataset
The Datasaurus dataset by Alberto Cairo | Source

When non-linear relations are to be expected, one of the alternatives to Pearson’s correlation should be taken into account. The two most popular ones are:

  1. Spearman’s rank correlation (Spearman’s Rho),

Spearman’s rank correlation is an alternative to Pearson correlation for ratio/interval variables. As the name suggests, it only looks at the rank values, i.e. it compares the two variables in terms of the relative positions of particular data points within the variables. It is able to capture non-linear relations, but there are no free lunches: we lose some information due to only considering the rank instead of the exact data points.

  1. Kendall rank correlation (Kendall Tau).

Another rank-based correlation measure is the Kendall rank correlation. It is similar in spirit to Spearman’s correlation but formulated in a slightly different way (Kendall’s calculations are based on concordant and discordant pairs of values, as opposed to Spearman’s calculations based on deviations). Kendall is often regarded as more robust to outliers in the data.

If at least one of the compared variables is of ordinal type, Spearman’s or Kendall rank correlation is the way to go. Due to the fact that ordinal data contains only the information on the ranks, they are both a perfect fit, while Pearson’s linear correlation is of little use.

Another scenario is when both variables are nominal. In this case, we can choose from a couple of different correlation measures:

  • Cramer’s V, which captures the association between the two variables into a number ranging from zero (no association) to one (one variable completely determined by the other).
  • Chi-Squared statistic commonly used for testing for dependence between two variables. Lack of dependence suggests the particular feature is not useful.
  • Mutual information a measure of mutual dependence between two variables that seeks to quantify the amount of information that one can extract from one variable about the other.

Which one to choose? There is no one-size-fits-all answer. As usual, each method comes with some pros and cons. Cramer’s V is known to overestimate the association’s strength. Mutual information, being a non-parametric method, requires larger data samples to yield reliable results. Finally, the Chi-Squared does not provide information about the strength of the relationship, but rather only whether it exists or not.

We have discussed scenarios in which the two variables we compare are both interval or ratio, when at least one of them is ordinal, and when we compare two nominal variables. The final possible encounter is to compare a nominal variable with a non-nominal one.

In such cases, the two most widely-used correlation measures are:

  • ANOVA F-score, a chi-squared equivalent for the case when one of the variables is continuous while the other is nominal,
  • Point-biserial correlation a correlation measure especially designed to evaluate the relationship between a binary and a continuous variable.

Once again, there is no silver bullet. The F-score only captures linear relations, while point-biserial correlation makes some strong normality assumption that might not hold in practice, undermining its results.

Having said all that, which method should one choose in a particular case? The table below will hopefully provide some guidance in this matter.

Variable 1
Variable 2
Method
Comments
Interval / ratio
Variable 2:
Interval / ratio
Method:
Pearson’s r
Comments:
Only captures linear relations, assumes normality
Variable 2:
Method:
Spearman’s Rho
Comments:
When nonlinear relations are expected
Variable 2:
Method:
Kendall Tau
Comments:

When nonlinear relations are expected

Interval / ratio
Variable 2:

Ordinal

Method:
Spearman’s Rho
Comments:
Based on ranks only, captures nonlinearities
Variable 2:
Method:
Kendall Tau
Comments:

Like Rho, but more robust to outliers

Nominal

Variable 2:

Nominal

Method:
Cramer’s V
Comments:
May overestimate correlation strength
Variable 2:
Method:

Chi-Squared

Comments:
No info on correlation’s strength
Variable 2:
Method:

Mutual Information

Comments:
Requires many data samples.
Nominal
Variable 2:

Interval / ratio / ordinal

Method:
F-score
Comments:
Only captures linear relations
Variable 2:
Method:
Point-biserial
Comments:
Makes strong normality assumptions

Comparison of different methods

Take no prisoners: Boruta needs no human input

When talking about feature selection, we cannot fail to mention Boruta. Back in 2010, when it was first published as an R package, it quickly became famous as a revolutionary feature selection algorithm.

Why is Boruta a game-changer?

All the other methods we have discussed so far require a human to make an arbitrary decision. Unsupervised methods need us to set the variance or VIF threshold for feature removal. Wrappers require us to decide on the number of features we want to keep upfront. Filters need us to choose the correlation measure and the number of features to keep as well. Embedded methods have us select regularization strength. Boruta needs none of these.

Boruta is a simple yet statistically elegant algorithm. It uses feature importance measures from a random forest model to select the best subset of features, and it does so via introducing two clever ideas.

  1. First, the importance scores of features are not compared to one another. Rather, the importance of each feature competes against the importance of its randomized version. To achieve this, Boruta randomly permutes each feature to construct its “shadow” version. 

Then, a random forest is trained on the whole feature set, including the new shadow features. The maximum feature importance among the shadow features serves as a threshold. Of the original features, only those whose importance is above this threshold score a point. In other words, only features that are more important than random vectors are awarded points. 

This process is repeated iteratively multiple times. Since each time the random permutation is different, the threshold also differs, and so different features might score points. After multiple iterations, each of the original features has some number of points to its name. 

  1. The final step is to decide, based on the number of points each feature scored, whether it should be kept or discarded. Here enters the other of Boruta’s two clever ideas: we can model the scores using a binomial distribution.

Each iteration is assumed to be a separate trial. If the feature scored in a given iteration, it is a vote to keep it; if it did not, it’s a vote to discard it. A priori, we have no idea whatsoever whether a feature is important or not, so the expected percentage of trials in which the feature scores is 50%. Hence, we can model the number of points scored with a binomial distribution with p=0.5. If our feature scores significantly more times than this, it is deemed important and kept. If it scores significantly fewer times, it’s deemed unimportant and discarded. If it scores in around 50% of trials, its status is unresolved, but for the sake of being conservative, we can keep it.

For example, if we let Boruta run for 100 trials, the expected score of each feature would be 50. If it’s closer to zero, we discard it, if it’s closer to 100, we keep it.

Graph with example of Boruta
Boruta example | Source: author 

Boruta has proven very successful in many Kaggle competitions and is always worth trying out. It has also been successfully used for predicting energy consumption for building heating or predicting air pollution.

There is a very intuitive Python package to implement Boruta, called BorutaPy (now part of scikit-learn-contrib). The package’s GitHub readme demonstrates how easy it is to run feature selection with Boruta.

Which feature selection method to choose? Build yourself a voting selector

We have discussed many different feature selection methods. Each of them has its own strengths and weaknesses, makes its own assumptions, and arrives at its conclusions in a different fashion. Which one to choose? Or do we have to choose? In many cases combining all these different methods together under one roof would make the resulting feature selector stronger than each of its subparts.

The inspiration

One way to do it is inspired by ensembled decision trees. In this class of models, which includes random forests and many popular gradients boosting algorithms, one trains multiple different models and lets them vote on the final prediction. In a similar spirit, we can build ourselves a voting selector.

The idea is simple: implement a couple of feature selection methods we have discussed. Your choice could be guided by your time, computational resources, and data measurement levels. Just run as many different methods as you conveniently can afford. Then, for each feature, write down the percentage of selection methods that suggest keeping this feature in the data set. If more than 50% of the methods vote to keep the feature, keep it – otherwise, discard it.

The idea behind this approach is that while some methods might make wrong judgments with regard to some of the features due to their intrinsic biases, the ensemble of methods should get the set of useful features right. Let’s see how to implement it in practice!

The implementation

Let’s build a simple voting selector that ensembles three different features selection methods:

  • 1A filter method based on Pearson correlation.
  • 2An unsupervised method based on multicollinearity.
  • 3A wrapper, Recursive Feature Elimination. 

Let’s take a look at how such a voting selector might look like. 

Making the imports.

from itertools import compress

import pandas as pd
from sklearn.feature_selection import RFE, r_regression, SelectKBest
from sklearn.svm import SVR
from statsmodels.stats.outliers_influence import variance_inflation_factor

Next, Our VotingSelector class comprises four methods on top of the init constructor. Three of them implement the three feature selection techniques we would like to ensemble:

  • 1 _select_pearson() for Pearson correlation filtering
  • 2 _select_vif() for Variance Inflation Factor-based unsupervised approach
  • 3 _select_rbf() for the RBF wrapper

Each of these methods takes the feature matrix X and the targets y as inputs. The VIF-based method will not use the targets, but we use this argument anyway to keep the interface consistent across all methods so that we can conveniently call them in a loop later. On top of that, each method accepts a keyword arguments dictionary which we will use to pass method-dependent parameters. Having parsed the inputs, each method calls the appropriate sklearn or statsmodels functions which we have discussed before, to return the list of feature names to keep.

The voting magic happens in the select() method. There, we simply iterate over the three selection methods, and for each feature, we record whether it should be kept (1) or discarded (0) according to this method. Finally, we take the mean over these votes. For each feature, if this mean is greater than the voting threshold of 0.5 (which means that at least two out of three methods voted to keep a feature), we keep it. 

Here is the code for the entire class.

class VotingSelector():
   def __init__(self):
       self.selectors = {
           "pearson": self._select_pearson,
           "vif": self._select_vif,
           "rfe": self._select_rfe,
       }
       self.votes = None

   @staticmethod
   def _select_pearson(X, y, **kwargs):
       selector = SelectKBest(r_regression, k=kwargs.get("n_features_to_select", 5)).fit(X, y)
       return selector.get_feature_names_out()

   @staticmethod
   def _select_vif(X, y, **kwargs):
       return [
           X.columns[feature_index]
           for feature_index in range(len(X.columns))
           if variance_inflation_factor(X.values, feature_index) <= kwargs.get("vif_threshold", 10)
       ]

   @staticmethod
   def _select_rfe(X, y, **kwargs):
       svr = SVR(kernel="linear")
       rfe = RFE(svr, n_features_to_select=kwargs.get("n_features_to_select", 5))
       rfe.fit(X, y)
       return rfe.get_feature_names_out()

   def select(self, X, y, voting_threshold=0.5, **kwargs):
       votes = []
       for selector_name, selector_method in self.selectors.items():
           features_to_keep = selector_method(X, y, **kwargs)
           votes.append(
               pd.DataFrame([int(feature in features_to_keep) for feature in X.columns]).T
           )
       self.votes = pd.concat(votes)
       self.votes.columns = X.columns
       self.votes.index = self.selectors.keys()
       features_to_keep = list(compress(X.columns, self.votes.mean(axis=0) > voting_threshold))
       return X[features_to_keep]

Let’s see it working in practice. We will load the infamous Boston Housing data, which comes built-in within scikit-learn.

from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame(boston["data"], columns=boston["feature_names"])
y = boston["target"]

Now, running feature selection is as easy as this:

vs = VotingSelector()
X_selection = vs.select(X, y)

As a result, we get the feature matrix with only three features left.

      ZN  CHAS     RM
0    18.0   0.0  6.575
1     0.0   0.0  6.421
2     0.0   0.0  7.185
3     0.0   0.0  6.998
4     0.0   0.0  7.147
..    ...   ...    ...
501   0.0   0.0  6.593
502   0.0   0.0  6.120
503   0.0   0.0  6.976
504   0.0   0.0  6.794
505   0.0   0.0  6.030
[506 rows x 3 columns]

We can also glimpse at how each of our methods has voted by printing vs.votes.

        CRIM  ZN  INDUS  CHAS  NOX  RM  AGE  DIS  RAD  TAX  PTRATIO  B  LSTAT
pearson     0   1      0     1    0   1    0    1    0    0        0  1      0
vif         1   1      0     1    0   0    0    0    0    0        0  0      0
rfe         0   0      0     1    1   1    0    0    0    0        1  0      1

We might not be happy with only 3 out of the initial 13 columns left. Luckily, we can easily make the selection less restrictive by modifying the parameters of the particular methods. This can be done by simply adding appropriate arguments to the call to select, thanks to how we pass kwargs around.

Pearson and RFE methods need a pre-defined number of features to keep. The default has been 5, but we might want to increase it to 8. We can also modify the VIF threshold, that is the value of the Variance Inflation Factor above which we discard a feature due to multicollinearity. By convention, this threshold is set at 10, but increasing it to, say, 15 will result in more features being kept.

vs = VotingSelector()
X_selection = vs.select(X, y, n_features_to_select=8, vif_threshold=15)

This way, we have seven features left.

        CRIM  ZN  INDUS  CHAS  NOX  RM  AGE  DIS  RAD  TAX  PTRATIO  B  LSTAT
pearson     1   1      0     1    0   1    1    1    1    0        0  1      0
vif         1   1      1     1    0   0    0    1    0    0        0  0      1
rfe         1   0      1     1    1   1    0    1    0    0        1  0      1

Our VotingSelector class is a simple but generic template which you can extend to an arbitrary number of feature selection methods. As a possible extension, you could also treat all the arguments passed to select() as hyperparameters of your modeling pipeline and optimize them so as to maximize the performance of the downstream model.

Feature selection at Big Tech

Large technology companies such as GAFAM and the likes of it, with their thousands of machine learning models in production, are prime examples of how feature selection is operated in the wild. Let’s see what these tech giants have to say about it!

Google

Rules of ML is a handy compilation of best practices in machine learning from around Google. In it, Google’s engineers point out that the number of parameters the model can learn is roughly

proportional to the amount of data it has access to. Hence, the less data we have, the more features we need to discard. Their rough guidelines (derived from text-based models) are to use a dozen features with 1000 training examples or 100,000 features with 10 million training examples. 

Another crucial point in the document concerns model deployment issues, which can also affect feature selection. 

  • First, your set of features to select from might be constrained by what will be available in production at inference time. You may be forced to drop a great feature from training if it isn’t there for the model when it goes live. 
  • Second, some features might be prone to data drift. While the topic of tackling drift is a complex one, sometimes the best solution might be to remove the problematic feature from the model altogether.

Facebook

A couple of years ago, in 2019, Facebook came up with its own Neural Network suitable Feature Selection algorithm in order to save computational resources while training large-scale models. They further tested this algorithm on their own Facebook News Feed dataset so as to rank relevant items as efficiently as possible while working with a fewer-dimensional input. You can read all about it here.

Parting words

Thanks for reading till the end! I hope this article convinced you that feature selection is a crucial step in the data preparation pipeline and gave you some guidance as to how to approach it. 

Don’t hesitate to hit me up on social media to discuss the topics covered here or any other machine learning topics, for that matter. Happy feature selection!

References


READ NEXT

Real-World MLOps Examples: Model Development in Hypefactors

6 mins read | Author Stephen Oladele | Updated June 28th, 2022

In this first installment of the series “Real-world MLOps Examples,” Jules Belveze, an MLOps Engineer, will walk you through the model development process at Hypefactors, including the types of models they build, how they design their training pipeline, and other details you may find valuable. Enjoy the chat!

Company profile

Hypefactors provides an all-in-one media intelligence solution for managing PR and communications, tracking trust, product launches, and market and financial intelligence. They operate large data pipelines that stream in the world’s media data ongoingly in real-time. AI is used for many automations that were previously performed manually.

Guest introduction

Could you introduce yourself to our readers?

Hey Stephen, thanks for having me! My name is Jules. I am 26. I was born and raised in Paris, I am currently living in Copenhagen.

Hey Jules! Thanks for the intro. Walk me through your background and how you got to Hypefactors.

I hold a Bachelor’s in statistics and probabilities and a Master’s in general engineering from universities in France. On top of that, I also graduated in Data Science with a focus on deep learning from Danish Technical University, Denmark. I’m fascinated by multilingual natural language processing (and therefore specialized in it). I also researched anomaly detection on high-dimensional time series during my graduate studies with Microsoft. 

Today, I work for a media intelligence tech company called Hypefactors, where I develop NLP models to help our users gain insights from the media landscape. What currently works for me is having the opportunity to carry out models from prototyping all the way to production. I guess you could call me a nerd, at least that’s how my friend describes me, as I spent most of my free time either coding or listening to disco vinyl.

Model development at Hypefactors

Could you elaborate on the types of models you build at Hypefactors?

Even though we also have computer vision models running in production, we mainly build NLP (Natural Language Processing) models for various use cases. We need to cover multiple countries and handle many languages. The multilingual aspect makes developing with “classical machine learning” approaches hard. We craft deep learning models on top of the transformer library

We run all sorts of models in production, varying from span extraction or sequence classification to text generation. Those models are designed to serve different use cases, like topic classification, sentiment analysis, or summarisation.

Continue reading ->
Feature store and data ingestion mlops

How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack

Read more
EDA for tabular data

Exploratory Data Analysis for Tabular Data

Read more
Recommender system lessons

Recommender Systems: Lessons From Building and Deployment

Read more
Active learning

Active Learning: Strategies, Tools, and Real-World Use Cases

Read more