When facing difficulties during training (nan
s, loss does not converge, etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true
in the 'solver.prototxt'
file.
The training log then looks something like:
I1109 ...] [Forward] Layer data, top blob data data: 0.343971 I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...] [Forward] Layer conv1, param blob 1 data: 0 I1109 ...] [Forward] Layer relu1, top blob conv1 data: 0.0337982 I1109 ...] [Forward] Layer conv2, top blob conv2 data: 0.0249297 I1109 ...] [Forward] Layer conv2, param blob 0 data: 0.00875855 I1109 ...] [Forward] Layer conv2, param blob 1 data: 0 I1109 ...] [Forward] Layer relu2, top blob conv2 data: 0.0128249 . . . I1109 ...] [Forward] Layer fc1, top blob fc1 data: 0.00728743 I1109 ...] [Forward] Layer fc1, param blob 0 data: 0.00876866 I1109 ...] [Forward] Layer fc1, param blob 1 data: 0 I1109 ...] [Forward] Layer loss, top blob loss data: 2031.85 I1109 ...] [Backward] Layer loss, bottom blob fc1 diff: 0.124506 I1109 ...] [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067 I1109 ...] [Backward] Layer fc1, param blob 0 diff: 0.483772 I1109 ...] [Backward] Layer fc1, param blob 1 diff: 4079.72 . . . I1109 ...] [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06 I1109 ...] [Backward] Layer conv2, param blob 0 diff: 0.00661093 I1109 ...] [Backward] Layer conv2, param blob 1 diff: 0.10995 I1109 ...] [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06 I1109 ...] [Backward] Layer conv1, param blob 0 diff: 0.0220984 I1109 ...] [Backward] Layer conv1, param blob 1 diff: 0.0429201 E1109 ...] [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)
What does it mean?
At first glance you can see this log section divided into two: [Forward]
and [Backward]
. Recall that neural network training is done via forward-backward propagation:
A training example (batch) is fed to the net and a forward pass outputs the current prediction.
Based on this prediction a loss is computed.
The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.
Caffe Blob
data structure
Just a quick re-cap. Caffe uses Blob
data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob
has two "parts": data
and diff
. The values of the Blob
are stored in the data
part. The diff
part is used to store element-wise gradients for the backpropagation step.
Forward pass
You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:
I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...] [Forward] Layer conv1, param blob 1 data: 0
Layer "conv1"
is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0
) has data
I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114
That is the current L2 norm of the convolution filter weights is 0.00899.
The current bias (param blob 1
):
I1109 ...] [Forward] Layer conv1, param blob 1 data: 0
meaning that currently the bias is set to 0.
Last but not least, "conv1"
layer has an output, "top"
named "conv1"
(how original...). The L2 norm of the output is
I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037
Note that all L2 values for the [Forward]
pass are reported on the data
part of the Blobs in question.
Loss and gradient
At the end of the [Forward]
pass comes the loss layer:
I1109 ...] [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...] [Backward] Layer loss, bottom blob fc1 diff: 0.124506
In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1
is computed and passed to diff
part of fc1
Blob. The L2 magnitude of the gradient is 0.1245.
Backward pass
All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff
part of the Blobs (params and layers' inputs).
Finally
The last log line of this iteration:
[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)
reports the total L1 and L2 magnitudes of both data and gradients.
What should I look for?
If you have nan
s in your loss, see at what point your data or diff turns into nan
: at which layer? at which iteration?
Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8
your data/gradients are starting to blow up. Decrease your learning rate!
See that the diff
s are not zero. Zero diffs mean no gradients = no updates = no learning. If you started from random weights, consider generating random weights with higher variance.
Look for activations (rather than gradients) going to zero. If you are using "ReLU"
this means your inputs/weights lead you to regions where the ReLU gates are "not active" leading to "dead neurons". Consider normalizing your inputs to have zero mean, add "BatchNorm"
layers, setting negative_slope
in ReLU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With