Background

Key concepts used in the tutorial

Directed Causal Graphical Models

A directed causal graphical model represents causality in the form of a directed Acyclic Graph (DAG). This approach assumes that we will build a model that represents components of that system as a discrete set of variables. Those individual variables may be discrete or continuous. Their collective state represents a possible state of the overall data generating process. Further, we assume variables are causes/effects of other variables, and that these cause-effect relationships reflect true causality in the data generating process. All the variables in the DAG have a joint probability distribution and depending on the direction of the edge, we have a conditional probability distribution of the child node given the parent node.

Variational Auto Encoder

An autoencoder is a deep neural network architecture designed to learn representation for a set of data, typically in an unsupervised manner, by training the network to ignore signal noise. In order to generate 2D images, we need a generative model that is able to understand the distribution of how the images are created. Variational Autoencoders are directed probabilistic graphical models whose posteriors are approximated by a neural network with an autoencoder like architecture. The Autoencoder architecture comprises of an encoder unit, which reduces the large input space to a latent domain, usually of lower dimension than of input space, and a decoder unit which reconstructs the input space from the latent representation

After the encoder, there is a bottleneck layer which is represented as a latent representation of the data. Usually, this layer is represented by a standard normal distribution. But this layer would be difficult to backpropagate as they are stochastic.

Hence, we use a reparameterization trick, where we introduce a new parameter ๐œ– so that it allows us to reparametrize our latent space z in a way that allows backpropagation to flow through deterministic nodes. In a variational autoencoder set up, the learned latent distribution can be sampled and along with a decoder can be used to generate new data points, in our case, images.

Statistical Motivation of Variational Inference

This section covers the statistical motivation behind the variational autoencoders and how it captures the distribution of the data that we are trying to model. Suppose there exists some latent variable z which generates an observation x. In our case, X is our image and it can be observed whereas, the latent variable isnโ€™t.

Hence, we need to infer z from x which is basically the conditional distribution of z given x, P(Z|X).

Unfortunately, computing p(x) is difficult and turns out to be an intractable distribution.

Hence, we use variational inference to estimate this value. We do this by approximating p(z|x) by using another distribution q(z|x) such that it has a tractable distribution. If we define the parameters of q(z|x) such that it is very similar to p(z|x), we can use it to perform approximate inference of the intractable distribution. KL Divergence is used to measure the difference between two probability distributions. Hence, if we wanted to ensure p(z|x) and q(z|x) are similar then we could minimize the KL Divergence between these two distributions.

Minimizing the above equation is similar to maximizing the following.

The first term represents the reconstruction likelihood and the second term ensures that the learned distribution q is similar to the true prior distribution p

We can use q to infer the possible latent state which was used to generate an observation. We can further construct this model into a neural network architecture where the encoder learns a mapping from x to z and the decoder learns the mapping from z back to x. Our loss function for this network will consist of two terms, one which penalizes reconstruction error and a second term which encourages our learned distribution q(z|x) to be similar to the true prior distribution p(z) which we will assume follows a unit gaussian distribution for each dimension j in the latent space.

Last updated