I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following: <pre class="prettyprint"><code>model = Sequential() model.add(Dense(1024, input_shape=(n_train,))) model.add(Activation('relu')) model.add(Dropout(0.1)) model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.1)) model.add(Dense(256)) model.add(Activation('relu')) model.add(Dropout(0.1)) model.add(Dense(1)) sgd = SGD(lr=0.01, nesterov=True); #rms = RMSprop() #model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy']) model.compile(loss='mean_absolute_error', optimizer=sgd) model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] ) </code></pre> However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan: <pre class="prettyprint"><code>Train on 260000 samples, validate on 64905 samples Epoch 1/3 260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss: 13.4925 Epoch 2/3 88448/260000 [=========>....................] - ETA: 161s - loss: nan </code></pre> I tried using <code>RMSProp</code> instead of <code>SGD</code>, I tried <code>tanh</code> instead of <code>relu</code>, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all. Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans). Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule. Here are some things you could potentially try: <ol> <li>Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).</li> <li>Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.</li> <li>If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.</li> <li>Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.</li> </ol>

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value: <pre class="prettyprint"><code>print(np.any(np.isnan(X_test))) print(np.any(np.isnan(y_test))) </code></pre> you can solve this by adding a small value(0.000001) to Std like this, <pre class="prettyprint"><code>def standardize(train, test): mean = np.mean(train, axis=0) std = np.std(train, axis=0)+0.000001 X_train = (train - mean) / std X_test = (test - mean) /std return X_train, X_test </code></pre>

NaN loss when training regression network

Tags:

python

neural-network

keras

theano

loss-function

I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:

model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))

sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )

However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:

Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
 13.4925
Epoch 2/3
 88448/260000 [=========>....................] - ETA: 161s - loss: nan

I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.

Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

336

asked May 14 '16 23:05

The_Anomaly

4 Answers

The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.

In Keras you can use clipnorm=1 (see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1.

answered Oct 19 '22 00:10

pir

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

Here are some things you could potentially try:

Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

answered Oct 18 '22 22:10

1''

I faced the same problem before. I search and find this question and answers. All those tricks mentioned above are important for training a deep neural network. I tried them all, but still got NAN.

I also find this question here. https://github.com/fchollet/keras/issues/2134. I cited the author's summary as follows：

I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.

Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.

I agree with the above viewpoint: the input is sensitive for your network. In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may result in NaN after several steps of gradients. I think the input check is necessary. First, you should make sure the input does not include -inf or inf, or some extremely large numbers in absolute value.

answered Oct 18 '22 23:10

HenryZhao

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:

print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))

you can solve this by adding a small value(0.000001) to Std like this,

def standardize(train, test):


    mean = np.mean(train, axis=0)
    std = np.std(train, axis=0)+0.000001

    X_train = (train - mean) / std
    X_test = (test - mean) /std
    return X_train, X_test

answered Oct 19 '22 00:10

javac

Related questions
                            
                                How do I clear all variables in the middle of a Python script?
                            
                                Python - Convert a bytes array into JSON format
                            
                                Python: How to check a string for substrings from a list? [duplicate]
                            
                                Reading file using relative path in python project
                            
                                How do I use Django templates without the rest of Django?
                            
                                Pandas DataFrame stored list as string: How to convert back to list
                            
                                ValueError: unsupported pickle protocol: 3, python2 pickle can not load the file dumped by python 3 pickle?
                            
                                How do I get interactive plots again in Spyder/IPython/matplotlib?
                            
                                Method Resolution Order (MRO) in new-style classes?
                            
                                ImportError: No module named pandas
                            
                                "ImportError: No module named site" on Windows
                            
                                Scope of lambda functions and their parameters?
                            
                                Default filter in Django admin
                            
                                Change figure window title in pylab
                            
                                How can I check if a date is the same day as datetime.today()?
                            
                                Python Timezone conversion
                            
                                Pycharm and sys.argv arguments
                            
                                What is the most pythonic way to check if multiple variables are not None?
                            
                                AssertionError: View function mapping is overwriting an existing endpoint function: main
                            
                                Positional argument v.s. keyword argument

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With