I've noticed that a frequent occurrence during training is NAN
s being introduced.
Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.
Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?
The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?
This answer is not about a cause for nan
s, but rather proposes a way to help debug it.
You can have this python layer:
class checkFiniteLayer(caffe.Layer):
def setup(self, bottom, top):
self.prefix = self.param_str
def reshape(self, bottom, top):
pass
def forward(self, bottom, top):
for i in xrange(len(bottom)):
isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
if isbad>0:
raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isbad)/bottom[i].count))
def backward(self, top, propagate_down, bottom):
for i in xrange(len(top)):
if not propagate_down[i]:
continue
isf = np.sum(1-np.isfinite(top[i].diff[...]))
if isf>0:
raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isf)/top[i].count))
Adding this layer into your train_val.prototxt
at certain points you suspect may cause trouble:
layer {
type: "Python"
name: "check_loss"
bottom: "fc2"
top: "fc2" # "in-place" layer
python_param {
module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
layer: "checkFiniteLayer"
param_str: "prefix-check_loss" # string for printouts
}
}
In my case, not setting the bias in the convolution/deconvolution layers was the cause.
Solution: add the following to the convolution layer parameters.
bias_filler {
type: "constant"
value: 0
}
learning_rate is high and should be decreased The accuracy in the RNN code was nan, with select the low value for learning rate it fixes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With