My question is about number of feature maps after each convolution layer. Based on my studies, in each Convolution layer, Based on filters that we want, In the exact number we get feature maps. But in some tutorials I encountered some architectures that doesn't follow this. For example in this example:
The output of first convolution layer after pooling is 6 feature maps(Red Line). Using that feature maps in the next Convolution layer, 16 new feature maps produced(Green Line), But how? each of previous layers feature maps should create 1,2,3... new feature maps and we should not get 16 new feature maps in the next layer. How this happened? My assumption is not right?
Block1_conv1 actually contains 64 feature maps, since we have 64 filters in that layer. But we are only visualizing the first 8 per layer in this figure. There are some interesting observations about the feature maps as we progress through the layers.
A CNN is composed of two basic parts of feature extraction and classification. Feature extraction includes several convolution layers followed by max-pooling and an activation function.
The output of the model will be feature maps, which are an intermediate representation for all layers after the first layer. This is based on the model we have used for training. Load the input image for which we want to view the Feature map to understand which features were prominent to classify the image.
Formula for spatial size of the output volume: K*((W−F+2P)/S+1), where W - input volume size, F the receptive field size of the Conv Layer neurons, S - the stride with which they are applied, P - the amount of zero padding used on the border, K - the depth of conv layer.
I too had this confusion for a while and only after some digging did the mist clear away.
Difference betweeen convolution with 1 channel and convolution with multiple channels This is where my understanding was going wrong. I will make an attempt to explain this difference. I am no expert so please bear with me
Convolution operation with a single channel
When we think of a simple gray scale 32X32 image and a convolution operation we are applying 1 or more convolution matrixes in the first layer.
As per your example, each of these convolutional matrices of dimension 5X5 produces a 28x28 matrix as an output. Why 28X28? Because, you can slide a window of 5 pixels square in 32-5+1=28 positions assuming stride=1 and padding=0.
In such a scenario, each of your convolution matrix has 5X5=25 trainable weights + 1 trainable bias. You can have as many convolution kernels you like. But, each of the kernels would be 2 dimensional and each of the kernels would produce an output matrix of dimension 28X28 which is then fed to the MAXPOOL layer.
Convolutional operation with multiple channels
What if the image would have been a RGB 32X32 picure? As per popular literature the image should be treated as comprising of 3 channels and a convolution operation should be carried out on each of these channels. I must admit that I hastily drew some misleading conclusions. I was under the impression that we should use three independent 5X5 convolution matrices - 1 for each channel. I was wrong.
When you have 3 channels , each of your convolution matrix should be of dimension 3X5X5 - think of this as a single unit comprising of a 5X5 matrix stacked 3 times. Therefore you have 5x5x3=75 trainable weights + 1 trainable bias.
What happens in the second convolutional layer?
In your example, I found it easier to visualize that the 6 feature maps produced by the first CONV1+MAXPOOL1 layer as 6 channels. Therfore applying the same RGB logic like before, any convolution kernel that we apply in the second CONV2 layer should have a dimension 6X5X5. Why 6? Because we the CONV1+MAXPOOL1 has produced 6 feature maps. Why 5x5? In your example you have chosen a windo dimension of 5x5. Theoretically, I could have chosen 3x3, in which case the kernel dimension would be 6X3X3.
Therefore in the current example, if you have N2 convolutional matrixes in CONV2 layer then each of these N2 kernels will be a matrix of size 6X5X5 . In the current example N2=16 and the convolution operation of a kernel of dimension 6X5X5 on an input image with 6 channels X 14X14 will produce N2 matrices each of dimension 10X10. Why 10? 10=14-5+1 (stride=1,padding=0).
You now have N2=16 matrices lined up for MAXPOOL2 layer.
Reference:LeNet architecture
http://deeplearning.net/tutorial/lenet.html
Notice the encircled region. You can see that in the second convolution layer, the operation is shown to span across each of the 4 feature maps that were produced by the first layer.
Reference:Andrew Ng lectures
https://youtu.be/bXJx7y51cl0
Reference:How does Convolution arithmetic with multiple channels looks like?
I found another SFO question which nicely described this. How a Convolutional Neural Net handles channels
Take note that in the referenced example, the information in the 3 channels in squashed into a 2 dimensional matrix. This is why your 6 feature maps from CONV1+MAXPOOL1 layer no longer seem to appear to contribute to the dimension of the first fully connected layer.
The second convolution works as follows:
NOTE: Number of channels of the input matrix and number of channels in each filter must match in order to be able to perform element-wise multiplication.
So the main difference between first and second convolutions is that the # of channels in input matrix in first convolution is 1
so we will use 6
filters where each filter has only one channel (depth of matrix).
For the second convolution the input matrix has 6 channels (feature maps
), so each filter for this convolution must have 6 channels as well. For example: each of 16 filters will have the 6@3x3
shape.
The result of a convolution step for a single filter of 6@3x3
shape will be a single channel of WxH
(Width, Heigth) shape. After applying all 16 filters (where each of them has shape: 6@3x3
) we will get 16 channels, where each channel is a result of convolution of a single filter.
Let me give you a basic idea of how convolution works with a simple example:
If we have an input image of size 32x32 with 3 channels, and if we choose to use a 5x5 filter, the dimensions of the filter would be 5x5x3 (implicitly).
So now the convolution will take place between the corresponding channels of input image and filter, producing 3 channels, which would be added to produce a single output for the given 5x5x3 filter.
To answer your question, in the given architecture, they are using 6 filters for the input image, then using 16 filters of size (y x y x 6) where y is supposed to be the height and width of the filter chosen.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With