Zehui Chen

02 Aug 2019

# Revisiting Variational Autoencoder

Given examples X distributed according to some unknown distribution $P_{gt}(X)$, the goal is to learn a model $P$ which we can sample from, such that $P$ is as similiar as possible to $P_{gt}$.

## Drawbacks of existed methods

• require strong assumptions about the structure in the data
• make severe approximations, leading to suboptimal models
• rely on computationally expensive inference procedures like Markov Chain Monte Carlo.

## Maximum Likelihood Estimation

Beforer starting VAE, we would like to revisit the relationship between maximum likelihood and KL divergence. So, if we are interested in estimating a true distribution $p(x)$, we introduce a set of candidate distributions $\mathcal{P}_x$. Hence, though we don’t have an access to the true distirbution $p^*(x)$, we do have the access to finite samples from $\mathcal{P}_x$. We denote uniform sampling from this finite dataset as $\hat{p}(x)$. The goal of maximum likelihood estimation is to find the best $p \in \mathcal{P}_x$ to approximate $p^(x)$ as measured by the Kullback-Leibler divergence,

$\min \limits_{p\in\mathcal{P}_x} D(\hat{p}||p_g) = \min \limits_{p\in\mathcal{P}_x}\mathbb{E}_{\hat{p}(x)} \displaystyle \left[{\rm In}\frac{\hat{p}(x)}{p(x)}\right] \\ \equiv \max_{p\in \mathcal{P}_x}\mathbb{E}_{\hat{p}(x)}[{\rm In} p(x)]$

From the above, we can find that mimizing the KL divergence is equalvalent to maximizing the log likelihood.

## Goal

Maximize the probability of each X in the training set under the entire generative process,

$P(X) = \int P(X|z; \theta)P(z)dz$
• In terms of the shape of $z$, Gaussian distribution is ok since any distribution in $d$ dimension can be generated by taking a set of $d$ variables that are normally distributed and mapping them through a sufficiently complicated function.

• The second problem is to maximize $P(X)$. An intuitive idea is sample points $z_i$ from $P(z)$ and then to approximate $P(X)$ with

$\sum_{i=0}^{n}P(X|z_i)$

However, this is highly compuationally. VAE alters the sampling procedure.

## Most Interesting Part

### Setting the objective

For most $z$, $P(X\lvert z)$ provide no information to our estimate of $P(X)$. The key idea of VAE is to sample values of $z$ that are likely to have produces $X$ and compute $P(X)$ just from those. More specially, we need a new function $Q(z\lvert X)$ which can take a value of $X$ and give us a distribution over $z$. Since the space of $z$ values that are likely under $Q$ is much smaller than the space of all $z$ s that are under the prior $P(z)$. Therefore, our task becomes to compute $E_{z \in Q}P(X\lvert z)$ and relate it with $P(X)$.

The relationship between $E_{z\in Q}P(X\lvert z)$ and $P(X)$ can be measured by KL divergence or $\mathcal{D}$.

$\mathcal{D}[Q(z)||P(z|X)] = E_{z\in Q}[{\rm log}Q(z) - {\rm log}P(z|X)]$

By applying Bayes rule to $P(z\lvert X)$:

$\mathcal{D}[Q(z)||P(z|X)] = E_{z\in Q}[{\rm log}Q(z) - {\rm log}P(X|z) - {\rm log}P(z)] + {\rm log}P(X)$

Negating both sides, rearranging, we can get:

${\rm log}P(X) - \mathcal{D}[Q(z)||P(z|X)] = E_{z\in Q}[{\rm log}P(X|z)] - \mathcal{D}[Q(z)||P(z)]$

Since we are interested in infering $P(X)$, it makes sense to construct a $Q$ which does depend on $X$, and in particular, one which makes $\mathcal{D}[Q(z)\lvert \lvert P(z\lvert X)]$ small:

${\rm log}P(X) - \mathcal{D}[Q(z|X)||P(z|X)] = E_{z\in Q}[{\rm log}P(X|z)] - \mathcal{D}[Q(z|X)||P(z)]$

Until now, we reach the core of the variational autoencoder. The left term have the quantity we want to maximize: ${\rm log}P(X)$ plus an error term. The right hand is something that we can optimize through stochastic gradient descent. Let have a deeper look at the rigth hand term.

$E_{z\in Q}[{\rm log}P(X\lvert z)]$ enforces that we want the conditional X can be well reconstructed with sampled $z$, besides, $z$ was generated or sampled from distribution $Q$. And $\mathcal{D}[Q(z\lvert X)\lvert \lvert P(z)]$ is to force the Q to be similar to P. If we put these constraints into an autoencoder framework, it is quite natural to view $P$ as decoder and $Q$ as encoder.

### Optimizing the objective

Firstly, what the form that $Q(z\lvert X)$ will take? The usual choice is to say that $Q(z\lvert X)$ follows an Gaussian distribution $\mathcal{N}(z\lvert \mu(X,\theta), \sum(X, \theta))$. Since we have already assumed that $P(z)$ can be viewed as an Gaussian distribution, the last term $\mathcal{D}[Q(z\lvert X)\lvert\lvert P(z)]$ is now a KL-divergence between two Gaussian distribution.

$\mathcal{D}[\mathcal{N}(\mu(X), \Sigma(X))||\mathcal{N}(0, I)] \\=\frac{1}{2}(tr(\Sigma(X))+(\mu(X))^T(\mu(X)) - k - {\rm log}det(\Sigma(X)))$

where $k$ is the dimensionally of the distribution.

Suppose we are doing stochastic gradient descent over different values of $X$ sampled from a dataset $D$, the full equation here can be rewrited as:

$E_{X\in D}[{\rm log}P(X) - \mathcal{D}[Q(z\lvert X)\lvert \lvert P(z \lvert X)]] = \\ E_{X\in D}[E_{z\in Q}[{\rm logP(X\lvert z)}] - \mathcal{D}[Q(z\lvert X)\lvert \lvert P(Z)]]$

If we compute the gradient of the right term, $Q(z\lvert X)$ is untractable. Though we may sample $z$ for forward propagation but this sampling can not propogate back, which is a non-continuous operation and has no gradient. To tackle with this issue, the solution, called “reparameterization trick” was proposed to move the sampling into an input layer. Given $\mu(X)$ and $\Sigma(X)$, we can sample from $\mathcal{N}(\mu(X), \Sigma(X))$ by first sampling $\epsilon$ from $\mathcal{N}(0, I)$, then computing $z = \mu(X)+\Sigma^{1/2}(X)\cdot \epsilon$.

Figure: A training-time variational autoencoder implemented as a feedforward neural network, where $P(X\lvert z)$ is Gaussian. Left is without the “reparameterization trick”, and right is with it. Red shows sampling operations that are non-differentiable. Blue shows loss layers. The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.

## Reference

[1] Tutorial on Variational Autoencoders(https://arxiv.org/abs/1606.05908)

[2] DENSITY ESTIMATION: VARIATIONAL AUTOENCODERS (http://ruishu.io/2018/03/14/vae/)

End of Post
at 08:52