Revisiting Variational Autoencoder
Given examples X distributed according to some unknown distribution $P_{gt}(X)$, the goal is to learn a model $P$ which we can sample from, such that $P$ is as similiar as possible to $P_{gt}$.
Drawbacks of existed methods
 require strong assumptions about the structure in the data
 make severe approximations, leading to suboptimal models
 rely on computationally expensive inference procedures like Markov Chain Monte Carlo.
Maximum Likelihood Estimation
Beforer starting VAE, we would like to revisit the relationship between maximum likelihood and KL divergence. So, if we are interested in estimating a true distribution $p(x)$, we introduce a set of candidate distributions $\mathcal{P}_x$. Hence, though we don’t have an access to the true distirbution $p^*(x)$, we do have the access to finite samples from $\mathcal{P}_x$. We denote uniform sampling from this finite dataset as $\hat{p}(x)$. The goal of maximum likelihood estimation is to find the best $p \in \mathcal{P}_x$ to approximate $p^(x)$ as measured by the KullbackLeibler divergence,
\[\min \limits_{p\in\mathcal{P}_x} D(\hat{p}p_g) = \min \limits_{p\in\mathcal{P}_x}\mathbb{E}_{\hat{p}(x)} \displaystyle \left[{\rm In}\frac{\hat{p}(x)}{p(x)}\right] \\ \equiv \max_{p\in \mathcal{P}_x}\mathbb{E}_{\hat{p}(x)}[{\rm In} p(x)]\]From the above, we can find that mimizing the KL divergence is equalvalent to maximizing the log likelihood.
Goal
Maximize the probability of each X in the training set under the entire generative process,
\[P(X) = \int P(Xz; \theta)P(z)dz\]
In terms of the shape of $z$, Gaussian distribution is ok since any distribution in $d$ dimension can be generated by taking a set of $d$ variables that are normally distributed and mapping them through a sufficiently complicated function.

The second problem is to maximize $P(X)$. An intuitive idea is sample points $z_i$ from $P(z)$ and then to approximate $P(X)$ with
\[\sum_{i=0}^{n}P(Xz_i)\]However, this is highly compuationally. VAE alters the sampling procedure.
Most Interesting Part
Setting the objective
For most $z$, $P(X\lvert z)$ provide no information to our estimate of $P(X)$. The key idea of VAE is to sample values of $z$ that are likely to have produces $X$ and compute $P(X)$ just from those. More specially, we need a new function $Q(z\lvert X)$ which can take a value of $X$ and give us a distribution over $z$. Since the space of $z$ values that are likely under $Q$ is much smaller than the space of all $z$ s that are under the prior $P(z)$. Therefore, our task becomes to compute $E_{z \in Q}P(X\lvert z)$ and relate it with $P(X)$.
The relationship between $E_{z\in Q}P(X\lvert z)$ and $P(X)$ can be measured by KL divergence or $\mathcal{D}$.
\[\mathcal{D}[Q(z)P(zX)] = E_{z\in Q}[{\rm log}Q(z)  {\rm log}P(zX)]\]By applying Bayes rule to $P(z\lvert X)$:
\[\mathcal{D}[Q(z)P(zX)] = E_{z\in Q}[{\rm log}Q(z)  {\rm log}P(Xz)  {\rm log}P(z)] + {\rm log}P(X)\]Negating both sides, rearranging, we can get:
\[{\rm log}P(X)  \mathcal{D}[Q(z)P(zX)] = E_{z\in Q}[{\rm log}P(Xz)]  \mathcal{D}[Q(z)P(z)]\]Since we are interested in infering $P(X)$, it makes sense to construct a $Q$ which does depend on $X$, and in particular, one which makes $\mathcal{D}[Q(z)\lvert \lvert P(z\lvert X)] $ small:
\[{\rm log}P(X)  \mathcal{D}[Q(zX)P(zX)] = E_{z\in Q}[{\rm log}P(Xz)]  \mathcal{D}[Q(zX)P(z)]\]Until now, we reach the core of the variational autoencoder. The left term have the quantity we want to maximize: ${\rm log}P(X)$ plus an error term. The right hand is something that we can optimize through stochastic gradient descent. Let have a deeper look at the rigth hand term.
$E_{z\in Q}[{\rm log}P(X\lvert z)]$ enforces that we want the conditional X can be well reconstructed with sampled $z$, besides, $z$ was generated or sampled from distribution $Q$. And $\mathcal{D}[Q(z\lvert X)\lvert \lvert P(z)]$ is to force the Q to be similar to P. If we put these constraints into an autoencoder framework, it is quite natural to view $P$ as decoder and $Q$ as encoder.
Optimizing the objective
Firstly, what the form that $Q(z\lvert X)$ will take? The usual choice is to say that $Q(z\lvert X)$ follows an Gaussian distribution $\mathcal{N}(z\lvert \mu(X,\theta), \sum(X, \theta))$. Since we have already assumed that $P(z)$ can be viewed as an Gaussian distribution, the last term $\mathcal{D}[Q(z\lvert X)\lvert\lvert P(z)]$ is now a KLdivergence between two Gaussian distribution.
\[\mathcal{D}[\mathcal{N}(\mu(X), \Sigma(X))\mathcal{N}(0, I)] \\=\frac{1}{2}(tr(\Sigma(X))+(\mu(X))^T(\mu(X))  k  {\rm log}det(\Sigma(X)))\]where $k$ is the dimensionally of the distribution.
Suppose we are doing stochastic gradient descent over different values of $X$ sampled from a dataset $D$, the full equation here can be rewrited as:
\[E_{X\in D}[{\rm log}P(X)  \mathcal{D}[Q(z\lvert X)\lvert \lvert P(z \lvert X)]] = \\ E_{X\in D}[E_{z\in Q}[{\rm logP(X\lvert z)}]  \mathcal{D}[Q(z\lvert X)\lvert \lvert P(Z)]]\]If we compute the gradient of the right term, $Q(z\lvert X)$ is untractable. Though we may sample $z$ for forward propagation but this sampling can not propogate back, which is a noncontinuous operation and has no gradient. To tackle with this issue, the solution, called “reparameterization trick” was proposed to move the sampling into an input layer. Given $\mu(X)$ and $\Sigma(X)$, we can sample from $\mathcal{N}(\mu(X), \Sigma(X))$ by first sampling $\epsilon$ from $\mathcal{N}(0, I)$, then computing $z = \mu(X)+\Sigma^{1/2}(X)\cdot \epsilon$.
Reference
[1] Tutorial on Variational Autoencoders(https://arxiv.org/abs/1606.05908)
[2] DENSITY ESTIMATION: VARIATIONAL AUTOENCODERS (http://ruishu.io/2018/03/14/vae/)
End of Post
at 08:52