I am trying to understand how the dimensions in convolutional neural network behave. In the figure below the input is 28-by-28 matrix with 1 channel. Then there are 32 5-by-5 filters (with stride 2 in height and width). So I understand that the result is 14-by-14-by-32. But then in the next convolutional layer we have 64 5-by-5 filters (again with stride 2). So why the result is 7-by-7- by 64 and not 7-by-7-by 32*64? Aren't we applying each one of the 64 filters to each one of the 32 channels?
One filter is the sum of all the dimensions in the previous layer. This means that the 5x5 filter sums up over all 32 dimensions and in essence is a weighted sum of 32*5*5 values. However the weight values are shared across dimensions. Then there are 64 such filters. A better explanation with images can be found here: http://cs231n.github.io/convolutional-networks/.
The depth is usually given implicitly. For example many Images are considered to have depth 3 (for the three color dimensions in each pixel). Then by a 5x5 filter we mean a 5x5x3 Filter. In your case the 5x5-Filter is really a 5x5x32 filter.
Depth one is usually explicitly stated (as in "5x5x1 filter").
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With