I am running the example code on Bayesian Neural Network implemented using Tensorflow Probability.
My question is about the implementation of the ELBO loss used for variational inference. The ELBO equals the summation of two terms, namely neg_log_likelihood
and kl
implemented in the code. I have difficulty understanding the implementation of the kl
term.
Here is how the model is defined:
with tf.name_scope("bayesian_neural_net", values=[images]):
neural_net = tf.keras.Sequential()
for units in FLAGS.layer_sizes:
layer = tfp.layers.DenseFlipout(units, activation=FLAGS.activation)
neural_net.add(layer)
neural_net.add(tfp.layers.DenseFlipout(10))
logits = neural_net(images)
labels_distribution = tfd.Categorical(logits=logits)
Here is how the 'kl' term defined:
kl = sum(neural_net.losses) / mnist_data.train.num_examples
I am not sure what neural_net.losses
is returning here, since there is no loss function defined for neural_net
. Clearly, there will be some values returned by neural_net.losses
, but I don't know what is the meaning of returned value. Any comments on this?
My guess is the L2 norm, but I am not sure. If that is the case, we are still missing something. According to the VAE paper, appendix B, the authors derived the KL term when the prior is standard normal. It turns out to be pretty close to an L2 norm of the variational parameters except there are additional log variance terms and a constant term. Any comments on this?
The losses
attribute of a TensorFlow Keras Layer represents side-effect computation such as regularizer penalties. Unlike regularizer penalties on specific TensorFlow variables, here, the losses
represent the KL divergence computation. Check out the implementation here as well as the docstring's example:
We illustrate a Bayesian neural network with variational inference, assuming a dataset of
features
andlabels
.import tensorflow_probability as tfp model = tf.keras.Sequential([ tfp.layers.DenseFlipout(512, activation=tf.nn.relu), tfp.layers.DenseFlipout(10), ]) logits = model(features) neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits( labels=labels, logits=logits) kl = sum(model.losses) loss = neg_log_likelihood + kl train_op = tf.train.AdamOptimizer().minimize(loss)
It uses the Flipout gradient estimator to minimize the Kullback-Leibler divergence up to a constant, also known as the negative Evidence Lower Bound. It consists of the sum of two terms: the expected negative log-likelihood, which we approximate via Monte Carlo; and the KL divergence, which is added via regularizer terms which are arguments to the layer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With