Machine learning is a subfield of Artificial Intelligence, where we try to build intelligent systems that have the function and behavior of our brain. Through ML, we try to build machines that can compute, extract patterns, automate routine tasks, diagnose biological anomalies, and prove scientific theories and hypotheses.

Because machine learning is a subset of AI, it doesn’t rely on hard-coded algorithms to find its way to the core solution, but it strengthens AI with the idea that it can extract knowledge from given information, and finds its way to the core idea without being hardcoded.

With the advent of machine learning, we can now design algorithms that can “tackle problems involving knowledge of the real world and make decisions that appear subjective.”

(Ian Goodfellow, “Deep Learning”)

Data plays a key role in all machine learning problems.

Why is it so important?

Data is a discrete arrangement of information that offers a continuous series of events. In this arrangement, the whole patterns of representation are hidden. If a machine can extract patterns that represent a particular event, we can say that the machine has learned that information, and if new data or information is fed into it then it can provide appropriate solutions and predictions.

## Representation Learning

Imagine an engineer designing an ML algorithm to predict malignant cells based on brain scans. To design the algorithm, the engineer has to rely heavily on patient data, because that’s where all the answers are.

Each observation or feature in that data describes the attributes of the patient. The machine learning algorithm that predicts the outcome has to learn how each feature correlates with the different outcomes: benign or malignant.

So in case of any noise or discrepancies in the data, the outcome can be totally different, which is the problem with most machine learning algorithms. Most machine learning algorithms have a superficial understanding of the data.

So what is the solution?

**Provide the machine with a more abstract representation of the data. **

For many tasks, it is impossible to know what features should be extracted. Alan Turing and his colleagues deciphering the enigma code observed the patterns that were regularly appearing in the messages. This is where the idea of representation learning truly comes into view.

In representation learning, the machine is provided with data and it learns the representation by itself. It’s a method of finding a representation of the data – the features, the distance function, the similarity function – that dictates how the predictive model will perform.

Representation learning works by reducing high-dimensional data into low-dimensional data, making it easier to find patterns, anomalies, and also giving us a better understanding of the behavior of the data altogether.

It also reduces the complexity of the data, so the anomalies and noise are reduced. This reduction in noise can be very useful for supervised learning algorithms.

**Invariance and disentangling**

The problem with representation learning is that it’s very difficult to get representations that can solve a given problem. Luckily, deep learning and deep neural networks started to prove very much goal-oriented and efficient.

The idea that deep neural networks can build complex concepts out of simple concepts lies at the core of deep learning [Ian Goodfellow, “Deep learning”].

**So where does the idea of representation learning fit in?**

People try to understand the success of deep learning. The two main conclusions are **representation learning **and** optimization**.

Deep learning is often seen as a black box, where a finite number of functions are used to find parameters that yield good generalization. This is achieved by optimization where the algorithm tries to correct itself by evaluating the model output with the ground truth.

### CHECK ALSO

The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

This process is done until the optimization function reaches a minimum point called the global minima. Most deep learning networks are heavily over-parameterized and suggest that they may overfit – where the algorithm performs well on training data but fails to perform well in new data. Recent work suggests that this is related to properties of the loss landscape, and to the implicit regularization performed by stochastic gradient descent (SGD), but the overall output is still noisy [Zhang et al., 2017].

Representation learning, on the other hand, focuses on the properties of the representation learned by the layers of the network (the activations) while remaining largely agnostic to the particular optimization process used [Emergence of Invariance and Disentanglement in Deep Representations, 2018].

Two major factors that usually occur in any data distribution are **variance** and **entanglement**. These two factors need to be eliminated to get a good representation from the data. Variance in the data can also be considered the sensitivity, and these sensitivities can turn the outcome upside down. Any model that we build has to be robust to variance, i.e. it has to be invariant because this can greatly harm the outcome of a deep learning model.

Entanglement is the way a vector in the data is connected or correlated to other vectors in the data. These connections make the data very complex and hard to decipher. What we are supposed to do is look for variables where the relationship is simple. It’s an easy way to transform high dimensional data into low dimensional data, or to transform high-dimensional data in a way that can be easily separated.

From the previous section, we learned that the ability of representation learning is that it learns abstract patterns that make sense to the data, while deep learning is often ascribed the ability of deep networks to learn representations that are invariant (insensitive) to nuisance such as translations, rotations, occlusions, and also “disentangled”, or separating factors in the high-dimensional space of data [Bengio, 2009]. But it is still important to learn to simplify a complex arrangement of data by creating models that are invariant and untangled.

“If neither the architecture nor the loss function explicitly enforce invariance and disentangling, how can these properties emerge consistently in deep networks trained by simple generic optimization?”

It turns out that we can answer this question by showing two things:

- Using classical notions of statistical decision and information theory, we show that invariance in a deep neural network is equivalent to the minimum of the representation it computes, and can be achieved by
**stacking layers**and injecting noise in the computation, under realistic and empirically validated assumptions. - Using an Information Decomposition of the empirical loss, we show that overfitting can be reduced by
**limiting the information content stored in the weights**. [Emergence of Invariance and Disentanglement in Deep Representations, 2018].

**The Information Bottleneck**

Information Bottleneck (IB) was introduced by Tishby et al. (1999). It was introduced with a hypothesis that it can extract relevant information by compressing the amount of information that can traverse the full network, forcing a learned compression of the input data.

This compressed representation not only reduces dimensions but along with it reduces the complexity of the data as well [Deep Learning and the Information Bottleneck Principle, Tishby 2015]. The idea is that a network rids noisy input data of extraneous details as if by squeezing the information through a bottleneck, leaving only the features most relevant to general concepts.

Tishby’s findings have the AI community buzzing. “I believe that the information bottleneck idea could be very important in future deep neural network research”, said Alex Alemi of Google Research, who has already developed new approximation methods for applying an information bottleneck analysis to large deep neural networks. The bottleneck could serve “not only as a theoretical tool for understanding why our neural networks work as well as they do currently but also as a tool for constructing new objectives and architectures of networks,” Alemi said.

**Latent variables**

A latent variable is a random variable that cannot be observed directly, but it lays the foundation of how the data is distributed. Latent variables also give us a low-level representation of high-dimensional data. They give us an abstract representation of how the data is distributed.

So why do we need latent variables?

All machine learning has a definite problem of learning complicated probability distribution *p(x)*. And these distributions are **constrained**, with only a limited set of high-dimensional data points *x* drawn from this distribution.

For example, to learn the probability distribution over images of cats we need to define a **distribution that** can model complex correlations between all pixels which form each image. Modelling this distribution directly is a tedious and challenging task, even unfeasible infinite time. Instead of modelling *p(x) *directly, we can introduce an (unobserved) latent variable z and define a conditional distribution *p(x | z)* for the data, which is called a likelihood. In probabilistic terms, *z* can be interpreted as a continuous random variable. For the example of cat images, z could contain a hidden representation of the type of cat, its color, or shape.

Having z, we can further introduce a** prior distribution** *p(z*) over the latent variables to compute the joint distribution over observed and latent variables *p(x,z) = p(x|z)p(z)*.

To obtain the data distribution *p(x) *we need to marginalize over the latent variables.

Prior to that we can compute **posterior distribution** using Bayes theorem.

The posterior distribution allows us to infer the latent variables given the observations.

Note that the integral in the marginalized equation has no analytical solution for most of the data we deal with, and we have to apply some method to infer the posterior. Essentially, models with latent variables can be used to perform a** generative process** from which the data was generated. This is known as the **generative model**.

It means if we want to generate a new data point, we first need to get a sample *z~ p(z)* and then use it to sample a new observation *x *from the conditional distribution *p(x|z)*. While doing this we also can assess whether the model provides a good approximation for the data distribution *p(x)*.

Mathematical models containing latent variables are by definition latent variable models. *These latent variables have much lower dimensions then the observed input vectors*. This yields in a **compressed representation of the data**.

Latent variables are basically found at the information bottleneck. Eventually, it is in this information bottleneck where we can find the abstract representation of the given high dimensional input. **The manifold hypothesis** states that the high-dimensional data lies on the lower-dimensional manifold.

Now that we know what latent variables are and where they can be found, we can now turn ourselves to various frameworks where representation learning is used.

## Representation learning in various learning frameworks

Current machine and deep learning models are still prone to variance and entanglement from given data (as discussed earlier). In order to improve the accuracy and performance of the model, we need to use representation learning so that the model can produce invariance and untangled results.

In this section, we will see how Representation learning can improve the model’s performance in three learning frameworks – supervised learning, unsupervised learning, and reinforcement learning.

**Supervised Learning **

Supervised learning is when the ML or DL model maps the input X to the output y. The idea behind supervised learning is that the learning algorithm learns the mapping from the input to output by optimization, where the algorithm tries to correct itself by evaluating the model output with the ground truth. This process is done until the optimization function reaches a minimum point called the** global minima. **

But at times, even though the optimization function reaches the global minima, it still does not perform well on new data, leading to overfitting. It turns out that a supervised learning model does not require a huge amount of data to learn the mapping from input to output, but it actually needs the learned features. When the learned features are passed into the supervised learning algorithm, it can improve the prediction accuracy up to 17%. [Effective Feature Learning with Unsupervised Learning for Improving the Predictive Models in Massive Open Online Courses].

**Unsupervised Learning**

Unsupervised learning is a type of ML where we don’t care about the labels, but only care about the observation itself. Unsupervised learning is not used for classification and regression, it is generally used to find underlying patterns, clustering, denoising, outlier detection, decomposition of data, and so on.

When working with data x we have to be very careful as to what features *z* we select so that the patterns extracted reflect the real data. It has been observed that more data will not necessarily provide you with good representations. We need to be careful to design a model that is flexible and also expressive so that the features extracted provide vital information.

When we have correct features then tasks like clustering, decomposition, or anomaly detection can be performed with greater certainty.

**Reinforcement Learning**

Reinforcement learning is another type of ML concerned with developing algorithms that should take actions in an environment in order to maximise chances of success.

Both supervised and reinforcement learning use mapping between input and output. In supervised learning, feedback provided to the agent is the correct set of actions for performing a task, and reinforcement learning uses rewards and punishment as signals for positive and negative behavior. The focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

### SEE ALSO

## Architecture

In this section, we will explore deep learning models – specifically, how representations are extracted inside a given type of neural nets.

**Multilayer perceptron or MLP**

The perceptron [Rosenblatt, 1958, 1962] is the simplest neural unit which consists of a series of inputs combined with weights, which are then compared with the ground truth.

MLP, or multi-layer perceptron, is a feed-forward neural network built by stacking layers of perceptron units. MLP consists of three layers of nodes – an input, a hidden, and an output layer. MLP, as a very basic artificial neural network, is sometimes referred to as the vanilla neural network.

The hidden node uses a series of non-linear activation functions that enable the network to distinguish and separate data that is not linearly-separable. This is where the power of ANN or MLP comes to play, as the universal approximation theorem states that “*neural networks can represent a wide variety of interesting functions when given appropriate weight. On the other hand, they do not provide a construction for the weights, but merely state such construction is possible”*.

**This concept serves as a building block for representation learning and latent variables.** At the very heart of this theorem is our goal to find the variables or the required weights that can represent the fundamental distribution of the whole data, such that when we plug those variables or the required weight in the unseen data, we get the near to close results like in the original data.

In a nutshell, ANN helps us to extract useful patterns from a given dataset.

**CNNs**

Convolutional Neural Network or CNN is one of the most widely used neural nets, especially when it comes to image processing and computer vision. Convolution neural nets are networks that use the convolution operation – a specialized kind of linear operation – in place of general matrix multiplication in at least one of their layers [Ian Goodfellow, Deep learning]. CNNs are regularized versions of multilayer perceptrons – one that is modelled to control overfitting. Images or natural images have a finite amount of statistical properties that are invariant to translation.

CNNs achieve translation, rotation and distortion invariance by:

- Locally exploiting the spatial features,
- Shared weights,
- Equivariant representations.

They also take a different approach towards regularization: they take advantage of the **hierarchical pattern** in data and assemble more complex patterns using smaller and simpler patterns. *CNN takes those properties into consideration by sharing the information across the network simultaneously, because they are capable of considering*** locality of features**.

CNNs are built to exploit the **spatial feature **– their joint probability distributions, spatial distribution and so on –** **of the input data, by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learned features return the appropriate response to a spatially local input pattern.

Stacking many such layers yields **non-linear filters** that become increasingly global or shared throughout the network. These filters can work with a larger footprint of vector space so that the network first creates **representations** of small parts of the input, by separating them with the help of non-linear filters, then from them assembles representations of larger areas. But keep in mind that not all the representations are important; we have to **prune** them using the information bottleneck.

We also have to keep in mind that with parameter sharing, CNN has been able to reduce the unique model parameters quite significantly, while being able to increase the model size at the same time. This sharing of parameters causes the layers to have a property called Equivariant representations, which means that if the input changes, the output changes in the same way. This way we can prevent the model from overparameterization and overfitting.

CNNs provide us with the exact tool to extract useful information, more specifically spatial information or features that can later be pruned and customised according to your needs. Hence the use of CNNs became a common practice for building an **autoencoder **and **generative model**.

**Autoencoders**

Representations learned by deep networks are observed to be insensitive to complex noise or discrepancies of the data. To a certain extent, this can be attributed to the architecture. For instance, the use of convolutional layers and max-pooling can be shown to yield insensitivity to transformations.

Autoencoders, therefore, are neural networks that can be trained for the task of representation learning. Autoencoders attempt to copy its input to its output through a combination of an encoder and a decoder. Usually, autoencoders are trained using recirculation [Hilton and McClelland, 1988], a learning algorithm based on comparing the activation of the network of the input to the activation of the reconstructed input [Ian Goodfellow, Deep learning].

Traditionally, these were used for the task of dimensionality reduction or feature learning [LeCun, 1987; Bourland and Kamp, 1988; Hilton and Zemel, 1994], but recently it has been given to understand that the autoencoders and latent variable models – models which leverage the concept of prior and posterior distribution, like Variational Autoencoders – can be used for building generative models, meaning models which can generate new data. It achieves this by compressing the information in an information bottleneck such that only important features are extracted from the entire dataset, and those extracted features or representations can be used to generate new data.

In a nutshell, an encoder is a function that reduces the input into different representations and a decoder – which is also a function – transforms learned representation from the encoder back to the original format.

We will be discussing four types of autoencoder:

- Under complete autoencoder
- Regularised autoencoder
- Sparse autoencoder
- Denoising autoencoder

**Under-complete autoencoder**

If the data is simple, copying input to output might work well, but what if the data is extremely complex? In that case, our autoencoder will underfit. What we are interested in is preserving useful information. The model should learn the useful features.

In order to obtain such functionality, our model should be a **combination of shallow networks** – for both encode and decoder – with the bottleneck that is a **constraint to a smaller dimension than the original input**.

Autoencoders whose dimensions are less than the input dimension are called **undercomplete autoencoder**. By penalizing the network according to the reconstruction error, the model can learn and capture the most salient features.

We know neural networks are capable of learning non-linear functions, and autoencoders such as this one can be thought of as a non-linear PCA.

**Regularised autoencoder**

Undercomplete autoencoder can learn salient features when the bottleneck is constrained to have a smaller dimension than the original input. But we must be aware of the capacity of the encoder and decoder models. It has been observed that these autoencoders fail to learn anything useful if both encoder and decoder are given too much capacity [Ian Goodfellow, Deep learning].

Overcomplete autoencoders face the same issue as in the case of undercomplete autoencoders. They fail to learn anything if the bottleneck has the same, or even great dimension as the input. In the case of overcomplete autoencoders, it is also prone to copy the input to the output rather than learning important features.

Ideally, we could train any architecture of autoencoder successfully, choosing the bottleneck dimension and the capacity of the encoder and decoder based on the complexity of distribution.

**Regularised autoencoders** give us the ability to do so. They use a loss function that enables this model to learn useful features, which includes:

- Sparsity of the representation,
- Smallness of the derivative of the representation,
- Robustness to noise or to missing inputs.

One of the special features of regularized autoencoders is that they can be nonlinear and overcomplete, but still learn something useful about the data distribution.

**Sparse autoencoder**

It has been observed that when representations are learnt in a way that encourages sparsity, improved performance is obtained on classification tasks. These methods involve combinations of activation functions, sampling steps and different kinds of penalties [Alireza Makhzani, Brendan Frey — k-Sparse Autoencoders].

A sparse autoencoder is a type of model that has been regularised to respond to unique statistical features. An undercomplete autoencoder will use the entire network for every observation. A sparse autoencoder will be forced to selectively activate regions of the network depending on the input data. This eliminates the networks capacity to memorise the features from the input data, and since some of the regions are activated while others aren’t, the network therefore learns the useful information and features.

Essentially, there are two ways by which we can impose this sparsity constraint. These terms are:

- L1 regularisation
- KL-divergence

Both involve measuring the hidden layer activations for each training batch, and adding a term to the loss function in order to penalize excessive activations.

**Denoising autoencoder**

Another approach towards developing a generalised autoencoder is to create a new dataset, let’s say X` from X; where X` is the corrupted version of x. With this approach, we build a model that is able to generalise from slightly corrupt input data, but still maintain the uncorrupted data as our target output.

Essentially, our model isn’t able to simply develop a mapping which memorizes the training data because our input and target output are no longer the same. Rather, the model learns a vector field for mapping the input data towards a lower-dimensional manifold. If this manifold accurately describes the natural data, we’ve effectively “canceled out” the added noise.

**Generative models**

By definition generative modeling is an unsupervised learning task in machine learning that involves automatically discovering and learning the representations or patterns in input data in such a way that the model can be used to **generate** new examples. All these models represent probability distributions over multiple variables in some manner. The distributions that the generative model generates are high-dimensional. For example, in the classical deep learning methodology like classification and regression, we model a one-dimensional output, whereas in generative modelling we model high-dimensional output.

Essentially, we model the **joint distribution** of the data and we don’t have any labels.

Some of the uses of generative models are:

- Density estimation and outlier detection
- Data compression
- Mapping from one domain to another
- Language translation, text-to-speech

- Planning in model-based reinforcement learning
- Representation learning
- Understanding the data

Types of generative models:

- Autoregressive models
- RNN & Transformer language models, NADE, PixelCNN, WaveNet

- Latent variable models
- Tractable: e.g. Invertible / flow-based models (RealNVP, Glow, etc.)
- Intractable: e.g. Variational Autoencoders

- Implicit models
- Generative Adversarial Networks (GANs) and variants

**Boltzmann machines**

A Boltzmann machine is a neural network of symmetrically connected nodes that make their own decisions whether to activate or to stay idle. Boltzmann machines were originally introduced as a general approach to learning arbitrary probability distributions over binary vectors [Fahlman et al., 1983; Ackley et al., 1985; Hinton et al., 1984; Hinton and Sejnowski, 1986].

Boltzmann machines use a straightforward stochastic learning algorithm to find features that represent complex patterns in the input data. They only have input (visible) and hidden nodes. There is no output node in this model, this makes the model non-deterministic – it doesn’t depend on any type of output.

The image presents ten nodes, all of them inter-connected. They are often referred to as **states**. The red ones represent hidden nodes (h), and blue ones are for visible nodes (v).

Boltzmann Machines have their inputs connected, and that is what makes them fundamentally different. All these nodes exchange information among themselves and self-generate subsequent data, hence called generative deep models.

Here, visible nodes are what we measure and hidden nodes are what we don’t measure. This allows them to share information among themselves and self-generate subsequent data. When we input data, these nodes learn all the *parameters, their patterns and correlation among the data*. This model then gets ready to monitor and study abnormal behavior depending on what it has learnt.

While this program is quite slow in networks with extensive feature detection layers, it is fast in networks with a single layer of feature detectors, called **restricted boltzmann machines**.

RBMs are a two-layered artificial neural network with generative capabilities. They have the ability to learn a probability distribution over its set of input. RBMs were invented by Geoffrey Hinton, and can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling.

RBMs are a special class of boltzmann machines and they are restricted in terms of the connections between the visible and the hidden units. This makes it easy to implement them when compared to Boltzmann Machines.

As stated earlier, they are a two-layered neural network (one being the visible layer and the other one being the hidden layer), and these two layers are connected by a fully **bipartite graph**. This means that every node in the visible layer is connected to every node in the hidden layer but no two nodes in the same group are connected to each other. This restriction allows for more efficient training algorithms than what is available for the general class of Boltzmann machines, in particular, the gradient-based contrastive divergence algorithm.

RBM can be fine-tuned through the process of gradient descent and back-propagation. Such a network is called a **Deep Belief Network**. Although RBMs are occasionally used, most people in the deep-learning community have started replacing their use with General Adversarial Networks or Variational Autoencoders.

**Invertible models**

Invertible models (also known as normalizing flows) are a class of **tractable** generative models. The key idea is to approximate the data distribution by transforming the prior distribution using an invertible function. The Invertible models generate observations by applying an **invertible** and **differentiable** transformation *f _{θ}(z)* to samples from the prior.

To interpret a representation or prior, we need to pin meaning to parts of the feature encoding. That is, we have to disentangle the high-dimensional feature vector into multiple multidimensional factors. This disentangled mapping should be bijective so that disentangled features must translate back to the original representation.

This essentially means that there is no ambiguity as to which representation generated the given observation because the function is tractable.

This also means that we can compute latent variables by inverting the function and apply it to the observed data, and hence we can exactly recover the original variables that generated the latent variables. This makes representations **correlated** and **fully deterministic**.

Limitations of invertible models:

- The latent space and observation must have the same dimensionality
- The latent variables have to be continuous
- Observations have to be continuous or quantized
- Expressive models like this require a lot of layers, so they are computation and memory expensive
- Rigid Structure: lack of flexibility in model design

**Variational inference**

Our generative model must be more than just a black box that generates samples or makes predictions. We want to find the meaning of the data we’re dealing with, something that is more intrinsic and interpretable. We want to structure the model in a way that captures that interpretability, and not just to trace the latent variables back to observed data to find the correlation among them. We want our model to be flexible to the new data that comes in.

One of the limitations of the tractable model is that it is fully deterministic from the fact that the latent variables are tractable. This also means that the model is strictly rigid. If the new data fed into it has an outlier, it might not perform well. This is the major problem of real world data.

We need to build models that are flexible enough to address this kind of problem. Instead of finding the exact tractable latent variables we want to find the latent variables which lie under an approximate inference distribution.

This type of model falls under the category of **intractable models**.

In an intractable model the posterior probability distribution is approximated p(z|x), either by a variational inference how we like to do it in neural networks, or with Markov Chain Monte Carlo methods how we often do it in probabilistic graphical models. The Monte Carlo method proves to be computationally expensive as it tries to find the exact sample from the entire distribution, whereas variational inference tries to approximate the posterior with a tractable distribution using an optimization technique [Ian Goodfellow, Deep learning].

We approximate the exact posterior p(z|x) with a variational posterior q (z|x). are the **variational parameters** which we will optimize over to fit the variational posterior to the exact posterior. The variational posterior is sampled from the standard normal distribution of mean and standard deviation being 0 and 1 respectively. The sample is randomly picked thus making the model intractable.

The variational inference model consists of two important components:

- tractable components, which are differentiable and continuous making the model deterministic,
- intractable components, which make the model flexible and probabilistic.

Because of this, we can define a technique known as the **reparameterization trick,** which suggests that we randomly sample ε from a unit Gaussian, and then shift the randomly sampled ε by the latent distribution’s mean μ, and scale it by the latent distribution’s variance σ. Since and are tractable, we use the backpropagation technique to optimise the parameters thus finding the exact posterior.

This approach makes the model flexible and we can use any architecture to generate samples.

**KL Divergence**

Kullback-Leibler divergence or relative entropy is a measure of the difference between two distributions. Very often in Probability and Statistics we’ll replace observed data or a complex distribution with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.

The general idea of VI is to take an approximation *q(z)* from a tractable family of distributions, and then make this approximation as close as possible to the true posterior *p(z|x).* This is usually done by minimizing the Kullback-Leibler (KL) divergence between both distributions, defined as:

This reduces inference to an optimization problem [5]. The more similar *q(z)* and *p(z|x)*, the smaller the KL divergence. Note that this quantity is not a distance in a mathematical sense as it is not symmetric if we swap the distributions. Moreover, swapping the distributions in our case would mean we need to take the expectations with respect to *p(z|x)*, which is assumed to be intractable.

Now, the equation above still has the intractable posterior in its numerator inside the logarithm.

We know the formula for the posterior distribution:

Hence, we can rewrite:

The marginal likelihood* logp(x)* can be taken out of the expectation as it is not dependent on *z*. The quantity *F(q)* is the so-called Evidence Lower Bound (ELBO). The KL is always ≥ 0 so that it represents a lower bound to the evidence. The closer the ELBO is to the marginal likelihood, the closer the variational approximation will be to the true posterior distribution. The complex inference problem is reduced to a simpler optimization problem.

**Variational Autoencoders **

A Variational Autoencoder, or VAE [Kingma, 2013; Rezende et al., 2014], is a generative model which generates continuous latent variables that use learned approximate inference [Ian Goodfellow, Deep learning]. It is a modified version of an autoencoder.

The limitations with autoencoders is that they learn to generate compact representations and reconstruct their inputs well, but aside from a few applications like denoising autoencoders, they are fairly limited.

The fundamental problem with autoencoders is that the latent space they convert their inputs to and where their encoded vectors lie, may not be continuous, or allow easy interpolation.

Variational Autoencoders on the other hand have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: *their latent spaces are, by design, continuous, allowing easy random sampling and interpolation.*

It achieves this by making its encoder not output random latent variables, rather, outputting two vectors: a vector of means, μ, and another vector of standard deviations, σ (which will be the optimizing parameter of the model). Since we’re assuming that our prior follows a normal distribution, we’ll output two vectors describing the mean and variance of the latent state distributions.

Intuitively, the mean vector controls where the encoding of an input should be centered around, while the standard deviation controls the “area”, how much from the mean the encoding can vary.

By constructing our encoder model to output a range of possible values from which we’ll randomly sample to feed into our decoder model, we’re essentially enforcing a continuous, smooth latent space representation. For any sampling of the latent distributions, we’re expecting our decoder model to be able to accurately reconstruct the input. Thus, values which are nearby to one another in latent space should correspond with very similar reconstructions.

What we ideally want are encodings, all of which are as close as possible to each other while still being distinct, allowing smooth interpolation, and enabling the construction of new samples.

This sampling process requires some extra attention.

When training the model, we need to be able to calculate the relationship of each parameter in the network with respect to the final output loss using backpropagation. We cannot do this for a random sampling process. The way VAE achieves this is by a **reparameterization trick**. With this technique, we can optimize the parameter over the exact distribution while still maintaining the ability to randomly sample from that distribution.

The variational autoencoder achieves some great results in generating samples and is among the state-of-the-art approaches to generative modelling. Its main drawback is that the samples generated from the model tend to be blurry at times. The most obvious conclusion of the cause is an intrinsic effect of the maximum likelihood, which minimises DKL(pdata || pmodel). This essentially means that the model will assign high probability to the points that occur in the training set, and also to the other points which may be the cause of a blurry image [Ian Goodfellow, Deep learning].

**GANs**

Generative adversarial network or GANs [Goodfellow et., 2014], are another type of generative model that leverages a differential approach.

Generative adversarial networks are based upon the game theory where the two networks – generator and adversarial – compete against each other. The generator Gis a directed latent variable model that deterministically generates samples xfrom z, and the discriminator D is a function whose job is to **distinguish samples from the real dataset and the generator**.

The generator network directly produces samples while the discriminator network, its adversary, attempts to distinguish between the sample drawn from the training data and sample drawn from the generator by returning a probability value of whether xis a real training example rather than a fake sample drawn from the model.

GAN can be compared with Reinforcement Learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not.

During training, the generator tries to become better at generating real-looking images, while the discriminator trains to better classify those images as fake. The process reaches equilibrium at a point when the discriminator can no longer distinguish real images from fakes.

The generator model takes a fixed-length random vector as input and generates a sample which is drawn randomly from a Gaussian distribution, and the vector is used to seed the generative process. After training, points in this multidimensional vector space will correspond to points in the real data, forming a **compressed representation** of the data distribution as latent variables. This process is differential and continuous.

At the core, the model discovers underlying features like any other generative model, and then those features can be exploited in different ways to produce new samples of images.

In the case of GANs, the generator model applies meaning to points in a chosen latent space, such that new points drawn from the latent space can be provided to the generator model as input and used to generate new and different output examples.

The discriminator model on the other hand takes an example from the domain as input (real or generated) and predicts a binary class label of real or fake (generated). The real example comes from the training dataset. The generated examples are output by the generator model.

The discriminator is a normal classification model.

During the training the discriminator should recognise the real image and thus* D(x)* should be close to 1, and it should also be able to recognise the fake generated image which comes from the z, and so *D(G(z))* should be close to 0.

From this equation, we want to maximise the discriminator function while simultaneously minimizing the generator function.

As mentioned before the GAN is a differential model, which means that the parameters in the generator model can update itself using gradient descent. Meanwhile, a discriminator model maximizes the loss function using gradient ascent.

After the training process, the discriminator model is discarded as we are interested in the generator.

**{XXX}2vec models**

So far we have been focusing on unstructured data and how representation can be extracted from them. But, performing machine learning on structured data is complicated by the fact that such data does not have a vectorial form. Multiple approaches have emerged to construct vectorial representations of structured data, from kernel and distance approaches to recurrent, recursive, and convolutional neural networks. One of these processes is known as **embedding**.

Embedding is the process of learning another set of vector values from the input data, and expressing the original actual vector through another set of vectors. In the context of neural networks, embeddings are low-dimensional, *learned continuous vector representations of discrete variables*.

Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.

As mentioned earlier, traditional machine learning has mostly focused on the question of how to solve problems like classification or regression for deterministic problems, with manually engineered data representations [Bengio et al., 2013]. By contrast, representation learning focuses on the challenge of obtaining a vectorial representation in the first place, such that prior tasks become easy to solve [Bengio et al., 2013]. This type of approach is helpful for processing structured data, i.e. sequences, trees, and graphs, where vectorial representations are not immediately available [Hamilton et al., 2017b].

Two domains I’d like to mention where embedding is used quite often, are language – natural language processing – and recommendation systems.

The embedding techniques include:

- Matrix Factorisation
- Word2Vec
- xxx2vec

**Matrix factorisation** is used to remove similar feature vectors. Including items vector, user vector, user preference vector. It is primarily used in recommendation systems.

**word2vec** is naturally familiar to people doing natural language processing. As the name implies, word2vec encodes words into vectors. For example, the word “diarrhea” is encoded into [0.4442, 0.11345].

Word2Vec is a model used for learning semantic knowledge in an unsupervised manner from a large amount of text corpus, and it is widely used in natural language processing (NLP). The input is one-hot encoded and then fed into the neural nets, then it uses those vectors to characterize the semantic information of words by learning text through an embedding space (latent space) that makes semantically similar words very close in this space.

In Word2Vec, there are mainly two models: Skip-Gram and Continuous Bag of Words (CBOW). From an intuitive understanding, Skip-Gram is a given input word to predict the context. The CBOW is to predict the input word given the context.

There are many variants of 2vec also known as xxx2vec:

- Node2vec
- Struc2vec
- Metapath2vec

These are variants which come under the heading **Representation learning on graph structure**. These algorithms are designed to preserve the information in the vector space. The neighbor relation is important in the network of proximity vectors. All information is given some weight which clusters around each other. Thus if we are able to extract vital features or representation, then our subsequent task becomes easier and more accurate.

## Challenges to future progress

Generative modelling is an intuitive method for generating samples when there is a lack of data to work with. Most of the time these scenarios happen in medical scans, or other data that is very private, and should not be given access to the public. So using generative models we can create samples of data, and those can be evaluated by the experts in the respective fields. Since all deep learning models rely heavily on **representation** of the input data, we have to make sure that:

**Data should be as clean as it can be.**For instance, we must make sure the data is carefully curated by experts in the field, so that representation extracted is highly effective and generates good samples.**Architecture of the model should not be overwhelming or underwhelming.****For complex data use an expressive model to avoid underfitting.**Similarly for simple data use a simple model to avoid overfitting.- Since generative models are not so perfect in terms of producing ultra high definition samples,
**different models should be combined to take advantage of their complementary strengths.**

**I hope you learned something new from this article. Thank you for reading!**