# Dimensionality Reduction for Machine Learning 10 min Nilesh Barla 25th January, 2023 Data forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is where dimensionality reduction comes into play.

1. What is dimensionality reduction?
2. What is the curse of dimensionality?
3. Tools and libraries used for dimensionality reduction
4. Algorithms used for dimensionality reduction
5. Applications

## What is dimensionality reduction?

Dimensionality reduction is the task of reducing the number of features in a dataset. In machine learning tasks like regression or classification, there are often too many variables to work with. These variables are also called features. The higher the number of features, the more difficult it is to model them, this is known as the curse of dimensionality. This will be discussed in detail in the next section.

Additionally, some of these features can be quite redundant, adding noise to the dataset and it makes no sense to have them in the training data. This is where feature space needs to be reduced.

The process of dimensionality reduction essentially transforms data from high-dimensional feature space to a low-dimensional feature space. Simultaneously, it is also important that meaningful properties present in the data are not lost during the transformation.

Dimensionality reduction is commonly used in data visualization to understand and interpret the data, and in machine learning or deep learning techniques to simplify the task at hand.

### Curse of dimensionality

It is well known that ML/DL algorithms need a large amount of data to learn invariance, patterns, and representations. If this data comprises a large number of features, this can lead to the curse of dimensionality. The curse of dimensionality, first introduced by Bellman, describes that in order to estimate an arbitrary function with a certain accuracy the number of features or dimensionality required for estimation grows exponentially. This is especially true with big data which yields more sparsity

Sparsity in data is usually referred to as the features having a value of zero; this doesn’t mean that the value is missing. If the data has a lot of sparse features then the space and computational complexity increase. Oliver Kuss  shows that the model trained on sparse data performed poorly in the test dataset. In other words, the model during the training learns noise and they are not able to generalize well. Hence they overfit.

When the data is sparse, observations or samples in the training dataset are difficult to cluster as high-dimensional data causes every observation in the dataset to appear equidistant from each other. If data is meaningful and non-redundant, then there will be regions where similar data points come together and cluster, furthermore they must be statistically significant.

Issues that arise with high dimensional data are:

1. Running a risk of overfitting the machine learning model.
2. Difficulty in clustering similar features.
3. Increased space and computational time complexity.

Non-sparse data or dense data on the other hand is data that has non-zero features. Apart from containing non-zero features they also contain information that is both meaningful and non-redundant.

To tackle the curse of dimensionality, methods like dimensionality reduction are used. Dimensional reduction techniques are very useful to transform sparse features to dense features. Furthermore, dimensionality reduction is also used to clean the data and feature extraction.

## Tools and library

The most popular library for dimensionality reduction is scikit-learn (sklearn). The library consists of three main modules for dimensionality reduction algorithms:

1. Decomposition algorithms
• Principal Component Analysis
• Kernel Principal Component Analysis
• Non-Negative Matrix Factorization
• Singular Value Decomposition
2. Manifold learning algorithms
• t-Distributed Stochastic Neighbor Embedding
• Spectral Embedding
• Locally Linear Embedding
3. Discriminant Analysis
• Linear Discriminant Analysis

When it comes to deep learning, algorithms like autoencoders can be constructed to reduce dimensions and learn features and representations. Frameworks like Pytorch, Pytorch Lightning, Keras, and TensorFlow are used to create autoencoders.

### Recommended for you

How to Keep Track of Scikit-Learn Model Training Metadata

Knowledge Distillation: Principles, Algorithms, Applications

The Best ML Frameworks & Extensions For Scikit-learn

## Algorithms for dimensionality reduction

### Decomposition algorithms

Decomposition algorithm in scikit-learn involves dimensionality reduction algorithms. We can call various techniques using the following command:

`from sklearn.decomposition import PCA, KernelPCA, NMF`

#### Principal Component Analysis (PCA)

Principal Component Analysis, or PCA, is a dimensionality-reduction method to find lower-dimensional space by preserving the variance as measured in the high dimensional input space. It is an unsupervised method for dimensionality reduction.

PCA transformations are linear transformations. It involves the process of finding the principal components, which is the decomposition of the feature matrix into eigenvectors. This means that PCA will not be effective when the distribution of the dataset is non-linear.

Let’s understand PCA with python code.

```def pca(X=np.array([]), no_dims=50):

print("Preprocessing the data using PCA...")
(n, d) = X.shape
Mean = np.tile(np.mean(X, 0), (n, 1))
X = X - Mean
(l, M) = np.linalg.eig(np.dot(X.T, X))
Y = np.dot(X, M[:, 0:no_dims])
return Y```

PCA implementation is quite straightforward. We can define the whole process into just four steps:

1. Standardization: The data has to be transformed to a common scale by taking the difference between the original dataset with the mean of the whole dataset. This will make the distribution 0 centered.
2. Finding covariance: Covariance will help us to understand the relationship between the mean and original data.
3. Determining the principal components: Principal components can be determined by calculating the eigenvectors and eigenvalues. Eigenvectors are a special set of vectors that help us to understand the structure and the property of the data that would be principal components. The eigenvalues on the other hand help us to determine the principal components. The highest eigenvalues and their corresponding eigenvectors make the most important principal components.
4. Final output: It is the dot product of the standardized matrix and the eigenvector. Note that the number of columns or features will be changed.

Reducing the number of variables of data not only reduces complexity but also decreases the accuracy of the machine learning model. However, with a smaller number of features it is easy to explore, visualize and analyze, it also makes machine learning algorithms computationally less expensive. In simple words, the idea of PCA is to reduce the number of variables of a data set, while preserving as much information as possible.

Let’s also take a look at the modules and functions sklearn provides for PCA.

```from sklearn.datasets import load_digits
digits.data.shape```

(1797, 64)

The data consists of 8×8 pixel images, which means that they are 64-dimensional. To gain some understanding of the relationships between these points, we can use PCA to project them to lower dimensions, like 2-D:

```from sklearn.decomposition import PCA

pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)```

(1797, 64)

(1797, 2)

Now, let’s plot the first two principal components.

```plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();```

We can see that PCA optimally found the principal components that can quite effectively cluster similar distributions, for the most part.

#### Kernel PCA (KPCA)

The PCA transformations we described previously are linear transformations that are ineffective with the non-linear distribution. To deal with non-linear distribution, the basic idea is to use the kernel trick.

A kernel trick is simply a method to project non-linear data onto a higher dimensional space and separate different distributions of data. Once the distributions are separated we can use PCA to separate them linearly.

Kernel PCA uses a kernel function ϕ that calculates the dot product of the data for non-linear mapping. In other words, the function ϕ maps the original d-dimensional features into a larger, k-dimensional feature space by creating non-linear combinations of the original features.

Let assume a dataset x that contains two features x1 and x2:

After applying the kernel trick we get:

To get a more intuitive understanding of Kernel PCA let’s define a feature space that cannot be linearly separated.

```​​from sklearn.datasets import make_circles
from sklearn.decomposition import KernelPCA
np.random.seed(0)
X, y = make_circles(n_samples=400, factor=.3, noise=.05)```

Now, let’s plot and see our dataset.

```plt.figure(figsize=(15,10))
plt.subplot(1, 2, 1, aspect='equal')
plt.title("Original space")
reds = y == 0
blues = y == 1

plt.scatter(X[reds, 0], X[reds, 1], c="red",
s=20, edgecolor='k')
plt.scatter(X[blues, 0], X[blues, 1], c="blue",
s=20, edgecolor='k')
plt.xlabel("\$x_1\$")
plt.ylabel("\$x_2\$")```

As you can see in this dataset the two classes cannot be separated linearly. Now, let’s define kernel PCA and see how it separates this feature space.

```kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=10, )
X_kpca = kpca.fit_transform(X)
plt.subplot(1, 2, 2, aspect='equal')
plt.scatter(X_kpca[reds, 0], X_kpca[reds, 1], c="red",
s=20, edgecolor='k')
plt.scatter(X_kpca[blues, 0], X_kpca[blues, 1], c="blue",
s=20, edgecolor='k')
plt.title("Projection by KPCA")
plt.xlabel(r"1st principal component in space induced by \$phi\$")
plt.ylabel("2nd component")```

After applying KPCA, it is able to linearly separate the two classes in the dataset.

#### Singular Value Decomposition (SVD)

The singular value decomposition or SVD is a factorization method of a real or complex matrix. It is efficient when working with a sparse dataset; a dataset having a lot of zero entries. This type of dataset is usually found in the Recommender Systems, rating, and reviews dataset, et cetera.

The idea of SVD is that every matrix of shape nXp factorizes into A = USVT, where U is the orthogonal matrix, S is a diagonal matrix and  VT is also an orthogonal matrix.

The advantage of SVD is that the orthogonal matrices capture the structure of the original matrix A which means that their properties do not change when multiplied by other numbers. This can help us approximate A.

Now let’s understand SVD using code. To get a better understanding of the algorithm we will use a face dataset that scikit-learn provides.

```from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)```

Plot the images to get an idea of what we are working with.

```X = lfw_people.images.reshape(img_count, img_width * img_height)
X0_img = X.reshape(img_height, img_width)

plt.imshow(X0_img, cmap=plt.cm.gray)```

Create a function for easy visualization of images.

```def draw_img(img_vector, h=img_height, w=img_width):
plt.imshow( img_vector.reshape((h,w)), cmap=plt.cm.gray)
plt.xticks(())
plt.yticks(())
draw_img(X)```

Before applying SVD it is better to standardize the data.

```from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_std=False)
Xstd = scaler.fit_transform(X)```

After standardizing this is how the image looks.

It is worth noting that we can always retrieve the original image by performing the inverse transformation.

```Xorig = scaler.inverse_transform(Xstd)
draw_img(Xorig)```

Now, we can apply the SVD function from NumPy and decompose the matrix into three matrices.

```from numpy.linalg import svd

U, S, VT = svd(Xstd)```

To check that the function works we can always perform a matrix multiplication of the three matrices.

```US = U*S
Xhat = US @ VT[0:1288,:]

# inverse transform Xhat to reverse standardization
Xhat_orig = scaler.inverse_transform(Xhat)
draw_img(Xhat_orig)```

Now, let’s perform dimensionality reduction. To do that we just need to reduce the number of features from the orthogonal matrices.

```Xhat_500 = US[:, 0:500] @ VT[0:500, :]
# inverse transform Xhat to reverse standardization
Xhat_500_orig = scaler.inverse_transform(Xhat_500)
# draw recovered image
draw_img(Xhat_500_orig)```

We can further reduce more features and see the results.

```Xhat_100 = US[:, 0:100] @ VT[0:100, :]
# inverse transform Xhat to reverse standardization
Xhat_100_orig = scaler.inverse_transform(Xhat_100)
# draw recovered image
draw_img(Xhat_100_orig)```

Now let’s create a function that would allow us to reduce the dimensions of the image.

```def dim_reduce(US_, VT_, dim=100):

Xhat_ = US_[:, 0:dim] @ VT_[0:dim, :]

return scaler.inverse_transform(Xhat_)```

Plotting images with a different number of features.

```dim_vec = [50, 100, 200, 400, 800]

plt.figure(figsize=(1.8 * len(dim_vec), 2.4))

for i, d in enumerate(dim_vec):
plt.subplot(1, len(dim_vec), i + 1)
draw_img(dim_reduce(US, VT, d))```

As you can see the first image contains the least number of features yet it can still construct the abstract version of the image and as we increase the features, we eventually obtain the original image. This proves that SVD can retain the basic structure of the data.

#### Non-negative Matrix Factorization (NMF)

NMF is an unsupervised machine learning algorithm. When a non-negative input matrix X of dimension mXn is given to the algorithm, it is decomposed into the product of two non-negative matrices W and H. W is of the dimension mXp while H is of the dimension pXn.

Where Y = W.H

From the equation above you can see that to factorize the matrix, we need to minimize the distance. The most widely used distance function is the squared Frobenius norm; this is an extension of the Euclidean norm to matrices.

It is also worth noting that this problem is not solvable in general which is why it is approximated. As it turns out, NMF is good for parts-based representation of the dataset i.e. NMF provides an efficient, distributed representation, and can aid in the discovery of the structure of interest within the data.

Let’s understand NMF with code. We will use the same data that we used in SVD.

First, we will fit the model to the data.

```from sklearn.decomposition import NMF
model = NMF(n_components=200, init='nndsvd', random_state=0)
W = model.fit_transform(X)
V = model.components_```

NMF takes a bit of time to decompose the data. Once the data is decomposed we can then visualize the factorized components.

```num_faces = 20
plt.figure(figsize=(1.8 * 5, 2.4 * 4))

for i in range(0, num_faces):
plt.subplot(4, 5, i + 1)
draw_img(V[i])```

From the image above we can see that NMF is very efficient to capture the underlying structure of the data. It is also worth mentioning that NMF captures only the linear attributes.

1. Data compression and visualization
2. Robustness to noise
3. Easier to interpret

### Manifold learning

So far we have seen approaches that only involved linear transformation. But what do we do when we have a non-linear dataset?

Manifold learning is a type of unsupervised learning that seeks to perform dimensionality reduction of a non-linear dataset. Again, scikit-learn offers a module that consists of various nonlinear dimensionality reduction techniques. We can call those classes or techniques through this command:

`from sklearn.manifold import TSNE, LocallyLinearEmbedding, SpectralEmbedding`

#### t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding or t-SNE is a dimensionality reduction technique well suited for data visualization. Unlike PCA which simply maximizes the variance, t-SNE minimizes the divergence between two distributions. Essentially, it recreates the distribution of a high-dimensional space in a low-dimensional space rather than maximizing variance or even using a kernel trick.

We can get a high-level understanding of t-SNE in three simple steps:

1. It first creates a probability distribution for the high-dimensional samples.
2. Then, it defines a similar distribution for the points in the low-dimensional embedding.
3. Finally, it tries to minimize the KL-divergence between the two distributions.

Now let’s understand it with code. For t-SNE, we will use the MNIST dataset again. Firstly, we import TSNE and then the data as well.

```from sklearn.manifold import TSNE

print(digits.data.shape)

# There are 10 classes (0 to 9) with almost 180 images in each class
# The images are 8x8 and hence 64 pixels(dimensions)

#Displaying what the standard images look like
for i in range(0,5):
plt.figure(figsize=(5,5))
plt.imshow(digits.images[i])
plt.show()```

We will then store the digits in order using np.vstack.

```X = np.vstack([digits.data[digits.target==i] for i in range(10)])
Y = np.hstack([digits.target[digits.target==i] for i in range(10)])```

We will apply t-SNE to the dataset.

`digits_final = TSNE(perplexity=30).fit_transform(X)`

We will now create a function to visualize the data.

```def plot(x, colors):
palette = np.array(sb.color_palette("hls", 10))  #Choosing color palette

# Create a scatter plot.
f = plt.figure(figsize=(8, 8))
ax = plt.subplot(aspect='equal')
sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40,c=palette[colors.astype(np.int)])
# Add the labels for each digit.
txts = []
for i in range(10):
# Position of each label.
xtext, ytext = np.median(x[colors == i, :], axis=0)
txt = ax.text(xtext, ytext, str(i), fontsize=24)
txt.set_path_effects([pe.Stroke(linewidth=5, foreground="w"),
pe.Normal()])
txts.append(txt)
return f, ax, txts```

Now we perform data visualization on the transformed dataset.

`plot(digits_final,Y)`

As it can be seen, t-SNE clusters the data beautifully. Compared to PCA, t-SNE performs well on nonlinear data. The drawback with t-SNE is that when the data is big it consumes a lot of time. So it is better to perform PCA followed by t-SNE.

#### Locally Linear Embedding (LLE)

Locally Linear Embedding or LLE is a non-linear and unsupervised machine learning method for dimensionality reduction. LLE takes advantage of the local structure or topology of the data and preserves it on a lower-dimensional feature space.

LLE optimizes faster but fails on noisy data.

Let’s break the whole process into three simple steps:

1. Find the nearest neighbors of the data points.
2. Construction of a weight matrix, by approximating each data point as a weighted linear combination of its k-nearest neighbors and minimizing the squared distance between them and their linear representation.
3. Map the weights into a lower-dimensional space by using the eigenvector-based optimization technique. Dimensionality reduction technique: LLE | Source: S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding

#### Spectral embedding

Spectral Embedding is another non-linear dimensionality reduction technique that also happens to be an unsupervised machine learning algorithm. Spectral embedding aims to find clusters of different classes based on the low-dimensional representations.

We can again break the whole process into three simple steps:

1. Preprocessing: Construct a Laplacian matrix representation of the data or graph.
2. Decomposition: Compute eigenvalues and eigenvectors of the constructed matrix and then map each point to a lower-dimensional representation. Spectral embedding makes use of the second smallest eigenvalue and its corresponding eigenvector.
3. Clustering: Assign points to two or more clusters, based on the representation. Clustering is usually done using k-means clustering.

Applications: Spectral Embedding finds its application in image segmentation.

### Discriminant Analysis

Discriminant Analysis is another module that scikit-learn provides. It can be called using the following command:

`from sklearn.discriminant_analysis import LinearDiscriminantAnalysis`

#### Linear Discriminant Analysis (LDA)

LDA is an algorithm that is used to find a linear combination of features in a dataset. Like PCA, LDA is also a linear transformation-based technique. But unlike PCA it is a supervised learning algorithm.

LDA computes the directions, i.e. linear discriminants that can create decision boundaries and maximize the separation between multiple classes. It is also very effective for multi-class classification tasks.

To have a more intuitive understanding of LDA, consider plotting a relationship of two classes as shown in the image below.

One way to solve this problem is to project all the data points in the x-axis.

This approach will lead to information loss and would be redundant.

A better approach will be to compute the distance between all the points in the data and fit a new linear line that passes through them. This new line can now be used to project all the points.

This new line minimizes the variance and classifies the two classes efficiently by maximizing the distance between them.

LDA can be used for multivariate data as well. It makes data inference quite simple. LDA can be computed using the following 5 steps:

1. Compute the d-dimensional mean vectors for the different classes from the dataset.
2. Compute the scatter matrices (in-between-class and within-class scatter matrices). the Scatter matrix is used to make estimates of the covariance matrix. This is done when the covariance matrix is difficult to calculate or joint variability of two random variables is difficult to calculate.
3. Compute the eigenvectors (e1, e2, e3…ed) and corresponding eigenvalues (λ1,λ2,…,λd) for the scatter matrices.
4. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix W (where every column represents an eigenvector).
5. Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: Y=X×W (where X is an n×d dimensional matrix representing the n samples, and y are the transformed n×k-dimensional samples in the new subspace).

## Applications of dimentionality reduction

Dimensionality reduction finds its way in many real-life applications some of which are:

• Customer relationship management
• Text categorization
• Image retrieval
• Intrusion detection
• Medical image segmentation

• It helps in data compression by reducing features.
• It reduces storage.
• It makes machine learning algorithms computationally efficient.
• It also helps remove redundant features and noise.
• It tackles the curse of dimensionality

• It may lead to some amount of data loss.
• Accuracy is compromised.

## Final thoughts

In this article, we learned about dimensionality reduction and also about the curse of dimensionality. We touched on the different algorithms that are used in dimensionality reduction with mathematical details and through code as well.

It is worth mentioning these algorithms are supposed to be used based on the task at hand. For instance, if the nature of your data is linear then use decomposition methods otherwise use manifold learning techniques.

It is considered to be a good practice to first visualize the data and then decide which method to use. Also, do not restrict yourself to one method but explore differently and see which one is the most suitable.