Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of feature maps produced after each convolution layer in CNN's

My question is about number of feature maps after each convolution layer. Based on my studies, in each Convolution layer, Based on filters that we want, In the exact number we get feature maps. But in some tutorials I encountered some architectures that doesn't follow this. For example in this example:

enter image description here

The output of first convolution layer after pooling is 6 feature maps(Red Line). Using that feature maps in the next Convolution layer, 16 new feature maps produced(Green Line), But how? each of previous layers feature maps should create 1,2,3... new feature maps and we should not get 16 new feature maps in the next layer. How this happened? My assumption is not right?

like image 629
Fcoder Avatar asked May 16 '18 10:05

Fcoder


People also ask

How many feature maps does CNN have?

Block1_conv1 actually contains 64 feature maps, since we have 64 filters in that layer. But we are only visualizing the first 8 per layer in this figure. There are some interesting observations about the feature maps as we progress through the layers.

How many features are extracted in CNN?

A CNN is composed of two basic parts of feature extraction and classification. Feature extraction includes several convolution layers followed by max-pooling and an activation function.

Which layer of CNN gives feature map?

The output of the model will be feature maps, which are an intermediate representation for all layers after the first layer. This is based on the model we have used for training. Load the input image for which we want to view the Feature map to understand which features were prominent to classify the image.

How does CNN calculate feature map?

Formula for spatial size of the output volume: K*((W−F+2P)/S+1), where W - input volume size, F the receptive field size of the Conv Layer neurons, S - the stride with which they are applied, P - the amount of zero padding used on the border, K - the depth of conv layer.


3 Answers

I too had this confusion for a while and only after some digging did the mist clear away.

Difference betweeen convolution with 1 channel and convolution with multiple channels This is where my understanding was going wrong. I will make an attempt to explain this difference. I am no expert so please bear with me

Convolution operation with a single channel

When we think of a simple gray scale 32X32 image and a convolution operation we are applying 1 or more convolution matrixes in the first layer.

As per your example, each of these convolutional matrices of dimension 5X5 produces a 28x28 matrix as an output. Why 28X28? Because, you can slide a window of 5 pixels square in 32-5+1=28 positions assuming stride=1 and padding=0.

enter image description here

In such a scenario, each of your convolution matrix has 5X5=25 trainable weights + 1 trainable bias. You can have as many convolution kernels you like. But, each of the kernels would be 2 dimensional and each of the kernels would produce an output matrix of dimension 28X28 which is then fed to the MAXPOOL layer.

Convolutional operation with multiple channels

What if the image would have been a RGB 32X32 picure? As per popular literature the image should be treated as comprising of 3 channels and a convolution operation should be carried out on each of these channels. I must admit that I hastily drew some misleading conclusions. I was under the impression that we should use three independent 5X5 convolution matrices - 1 for each channel. I was wrong.

When you have 3 channels , each of your convolution matrix should be of dimension 3X5X5 - think of this as a single unit comprising of a 5X5 matrix stacked 3 times. Therefore you have 5x5x3=75 trainable weights + 1 trainable bias.

What happens in the second convolutional layer?

enter image description here

In your example, I found it easier to visualize that the 6 feature maps produced by the first CONV1+MAXPOOL1 layer as 6 channels. Therfore applying the same RGB logic like before, any convolution kernel that we apply in the second CONV2 layer should have a dimension 6X5X5. Why 6? Because we the CONV1+MAXPOOL1 has produced 6 feature maps. Why 5x5? In your example you have chosen a windo dimension of 5x5. Theoretically, I could have chosen 3x3, in which case the kernel dimension would be 6X3X3.

Therefore in the current example, if you have N2 convolutional matrixes in CONV2 layer then each of these N2 kernels will be a matrix of size 6X5X5 . In the current example N2=16 and the convolution operation of a kernel of dimension 6X5X5 on an input image with 6 channels X 14X14 will produce N2 matrices each of dimension 10X10. Why 10? 10=14-5+1 (stride=1,padding=0).

You now have N2=16 matrices lined up for MAXPOOL2 layer.

Reference:LeNet architecture

http://deeplearning.net/tutorial/lenet.html

enter image description here Notice the encircled region. You can see that in the second convolution layer, the operation is shown to span across each of the 4 feature maps that were produced by the first layer.

Reference:Andrew Ng lectures

https://youtu.be/bXJx7y51cl0

Reference:How does Convolution arithmetic with multiple channels looks like?

I found another SFO question which nicely described this. How a Convolutional Neural Net handles channels

enter image description here

Take note that in the referenced example, the information in the 3 channels in squashed into a 2 dimensional matrix. This is why your 6 feature maps from CONV1+MAXPOOL1 layer no longer seem to appear to contribute to the dimension of the first fully connected layer.

like image 105
Sau001 Avatar answered Oct 27 '22 20:10

Sau001


The second convolution works as follows:

  • input matrix shape: 6@14x14 (6 channels - result of applying 6 filters in the previous (first) convolution step)
  • the 6@14x14 input matrix will be convolved using 16 filters (each of filters should have 6 channels in order to match the number of channels in the input matrix
  • this will result in 16@5x5 output matrix

NOTE: Number of channels of the input matrix and number of channels in each filter must match in order to be able to perform element-wise multiplication.

So the main difference between first and second convolutions is that the # of channels in input matrix in first convolution is 1 so we will use 6 filters where each filter has only one channel (depth of matrix).

For the second convolution the input matrix has 6 channels (feature maps), so each filter for this convolution must have 6 channels as well. For example: each of 16 filters will have the 6@3x3 shape.

The result of a convolution step for a single filter of 6@3x3 shape will be a single channel of WxH (Width, Heigth) shape. After applying all 16 filters (where each of them has shape: 6@3x3) we will get 16 channels, where each channel is a result of convolution of a single filter.

like image 5
MaxU - stop WAR against UA Avatar answered Oct 27 '22 21:10

MaxU - stop WAR against UA


Let me give you a basic idea of how convolution works with a simple example:

  • If we have an input image of size 32x32 with 3 channels, and if we choose to use a 5x5 filter, the dimensions of the filter would be 5x5x3 (implicitly).

  • So now the convolution will take place between the corresponding channels of input image and filter, producing 3 channels, which would be added to produce a single output for the given 5x5x3 filter.

  • Hence whenever we say the depth, we mean the number of filters we wish to add. In case, if we want a depth of 6, it means we are using 6 filters of size 5x5x3 producing only 6 layers of output.

To answer your question, in the given architecture, they are using 6 filters for the input image, then using 16 filters of size (y x y x 6) where y is supposed to be the height and width of the filter chosen.

like image 4
Suji Grena Avatar answered Oct 27 '22 20:10

Suji Grena