Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras - Variational Autoencoder NaN loss

I'm trying to use the implementation of Variational Autoencoder that I found among the Keras examples (https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py).

I just refactored the code in order to use it more easily from a Jupyter notebook (my code: https://github.com/matbell/Autoencoders/blob/master/models/vae.py).

However, when I try to fit the model on my data I get the following output:

Autoencoders/models/vae.py:69: UserWarning: Output "dense_5" missing from loss dictionary. We assume this was done on purpose, and we will not be expecting any data to be passed to "dense_5" during training.
self.vae.compile(optimizer='rmsprop')

Train on 15474 samples, validate on 3869 samples
Epoch 1/50
15474/15474 [==============================] - 1s 76us/step - loss: nan - val_loss: nan
Epoch 2/50
15474/15474 [==============================] - 1s 65us/step - loss: nan - val_loss: nan
Epoch 3/50
15474/15474 [==============================] - 1s 69us/step - loss: nan - val_loss: nan
Epoch 4/50
15474/15474 [==============================] - 1s 62us/step - loss: nan - val_loss: nan

and the loss remains the same for all the training epochs.

I'm not so expert in Deep Learning and Neural Networks fields, so maybe I'm missing something....

This is the input data, where data and labels are two pandas.DataFrame.

In: data.shape
Out: (19343, 87)

In: label.shape
Out: (19343, 1)

And this is how I use the Vae class (from my code) in Jupyter notebook:

INPUT_SIZE = len(data.columns)
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.2)

vae = Vae(INPUT_SIZE, intermediate_dim=32)
vae.fit(X_train, X_test)

Thanks for any help!

like image 313
Mattia Campana Avatar asked Apr 03 '18 16:04

Mattia Campana


1 Answers

You might want to initialize your log_var dense layer to zeros. I was having problems with it myself (slightly different code, but effectively doing the same), and it turns out that, however small the variation weights were initialized to, they would explode in just a few rounds of SGD.

The random correlations between epsilon ~N(0,1) and the reconstruction error will be enough to gently bring the weights to nonzero.

Edit - also, the exponential wrapping the variation really helps exploding the gradients. Setting the initial value of the weights to zero gives an initial variation of one, because of the exponential. Initializing it to a low negative value, while giving off an initial close-to-zero variation, makes the gradient enormous on the very first runs. Zero gives me the best results.

like image 102
Monstah Avatar answered Sep 26 '22 01:09

Monstah