I am trying to understand VAE in-depth by implementing it by myself and having difficulties when back-propagate losses of the decoder input layer to the encoder output layer.
My encoder network outputs 8 pairs (sigma, mu) which I then combine with the result of a stochastic sampler to produce 8 input values (z) for the decoder network:
decoder_in = sigma * N(0,I) + mu
Then I run forward propagation for the decoder network, compute MSE reconstruction loss and back-propagate weights, and losses up to the decoder input layer.
Here I stuck completely since there is no comprehensible explanation of how to back-propagate losses from the decoder input layer to the encoder output layer.
My best idea was to store the results of sampling from N(0,I) to (epsilon) and use them in such a way:
L(sigma) = epsilon * dLz(decoder_in)
L(mu) = 1.0 * dLz(decoder_in)
It kind of works, but in the long run the sigma components of the encoded vector of distributions tend to regress to zeroes, so my VAE as a result also regressed to AE.
Also, I still have no clue how to integrate KL-loss in this scheme. Should I add it to the encoder loss or somehow combine it with the decoder MSE loss?
Explained! Variational Autoencoder is a an explicit type generative model which is used to generate new sample data using past data. VAEs do a mapping between latent variables, dominate to explain the training data and underlying distribution of the training data.
What is it? Variational autoencoder addresses the issue of non-regularized latent space in autoencoder and provides the generative capability to the entire space. The encoder in the AE outputs latent vectors.
ELBO is a lower bound of the logarithm of the marginal likelihood log p x ( x ; θ ) and constructed by introducing an extra distribution . The closer and the posterior p z | x ( ⋅ | x ; θ ) are, the tighter the bound is. The EM algorithm and VAE both iteratively optimize ELBO.
An Autoencoder is made of a pair of two connected neural networks: an encoder model and a decoder model. Its goal is to find a way to encode the celebrity faces into a compressed form (latent space) in such a way that the reconstructed version is as close as possible to the input.
To alleviate the issues present in a vanilla Autoencoder, we turn to Variational Encoders. The first change it introduces to the network is instead of directly mapping the input data points into latent variables the input data points get mapped to a multivariate normal distribution.
In this article, we are going to learn about the “reparameterization” trick that makes Variational Autoencoders (VAE) an eligible candidate for Backpropagation. First, we will discuss Autoencoders briefly and the problems that come with their vanilla variants. Then we will jump straight to the crux of the article — the “reparameterization” trick.
The function of the decoder is to generate an output from the latent vector that is very close to the input. Usually, in training autoencoders, we build these components together instead of building them independently.
General autoencoders are trained using a reconstruction loss, which measures the difference between the reconstructed and original image. Variational autoencoders are mostly the same, but they use a sampling of the bottleneck vector from a normal distribution to reduce overfitting.
The VAE does not use the reconstruction error as the cost objective if you use that the model just turns back into an autoencoder. The VAE uses the variational lower bound and a couple of neat tricks to make it easy to compute.
Referring to the original “auto-encoding variational bayes” paper
The variational lower bound objective is (eq 10):
1/2( d+log(sigmaTsigma) -(muTmu) - (sigmaTsigma)) + log p(x/z)
Where d is number of latent variable, mu and sigma is the output of the encoding neural network used to scale the standard normal samples and z is the encoded sample. p(x/z) is just the decoder probability of generating back the input x.
All the variables in the above equation are completely differentiable and hence can be optimized with gradient descent or any other gradient based optimizer you find in tensorflow
From what I understand, the solution should look like this:
L(sigma) = epsilon * dLz(decoder_in) - 0.5 * 2 / sigma + 0.5 * 2 * sigma
L(mu) = 1.0 * dLz(decoder_in) + 0.5 * 2 * mu
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With