In the diagram (architecture) below, how was the (fully-connected) dense layer of 4096 units derived from last max-pool layer (on the right) of dimensions 256x13x13
? Instead of 4096, shouldn't it be 256*13*13=43264 ?
Max pooling is a type of operation that is typically added to CNNs following individual convolutional layers. When added to a model, max pooling reduces the dimensionality of images by reducing the number of pixels in the output from the previous convolutional layer.
Followed by a max-pooling layer, the method of calculating pooling layer is as same as the Conv layer. The kernel size of max-pooling layer is (2,2) and stride is 2, so output size is (28–2)/2 +1 = 14.
To calculate it, we have to start with the size of the input image and calculate the size of each convolutional layer. In the simple case, the size of the output CNN layer is calculated as “input_size-(filter_size-1)”. For example, if the input image_size is (50,50) and filter is (3,3) then (50-(3–1)) = 48.
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. The pooling layer summarises the features present in a region of the feature map generated by a convolution layer.
If I'm correct, you're asking why the 4096x1x1
layer is much smaller.
That's because it's a fully connected layer. Every neuron from the last max-pooling layer (=256*13*13=43264
neurons) is connectd to every neuron of the fully-connected layer.
This is an example of an ALL to ALL connected neural network: As you can see, layer2 is bigger than layer3. That doesn't mean they can't connect.
There is no conversion of the last max-pooling layer -> all the neurons in the max-pooling layer are just connected with all the 4096 neurons in the next layer.
The 'dense' operation just means calculate the weights and biases of all these connections (= 4096 * 43264 connections) and add the bias of the neurons to calculate the next output.
It's connected the same was an MLP.
But why 4096? There is no reasoning. It's just a choice. It could have been 8000, it could have been 20, it just depends on what works best for the network.
You are right in that the last convolutional layer has 256 x 13 x 13 = 43264
neurons. However, there is a max-pooling layer with stride = 3
and pool_size = 2
. This will produce an output of size 256 x 6 x 6
. You connect this to a fully-connected layer. In order to do that, you first have to flatten the output, which will take the shape - 256 x 6 x 6 = 9216 x 1
. To map 9216
neurons to 4096
neurons, we introduce a 9216 x 4096
weight matrix as the weight of dense/fully-connected layer. Therefore, w^T * x = [9216 x 4096]^T * [9216 x 1] = [4096 x 1]
. In short, each of the 9216
neurons will be connected to all 4096
neurons. That is why the layer is called a dense or a fully-connected layer.
As others have said it above, there is no hard rule about why this should be 4096. The dense layer just has to have enough number of neurons so as to capture variability of the entire dataset. The dataset under consideration - ImageNet 1K - is quite difficult and has 1000 categories. So 4096
neurons to start with do not seem too much.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With