I can't give the correct number of parameters of AlexNet or VGG Net.
For example, to calculate the number of parameters of a conv3-256
layer of VGG Net, the answer is 0.59M = (3*3)*(256*256), that is (kernel size) * (product of both number of channels in the joint layers), however in that way, I can't get the 138M
parameters.
So could you please show me where is wrong with my calculation, or show me the right calculation procedure?
In a CNN, each layer has two kinds of parameters : weights and biases.
Conv2D Layers By applying this formula to the first Conv2D layer (i.e., conv2d ), we can calculate the number of parameters using 32 * (1 * 3 * 3 + 1) = 320, which is consistent with the model summary. The input channel number is 1, because the input data shape is 28 x 28 x 1 and the number 1 is the input channel.
If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field).
If you refer to VGG Net with 16-layer (table 1, column D) then 138M
refers to the total number of parameters of this network, i.e including all convolutional layers, but also the fully connected ones.
Looking at the 3rd convolutional stage composed of 3 x conv3-256
layers:
The convolution kernel is 3x3 for each of these layers. In terms of parameters this gives:
As explained above you have to do that for all layers, but also the fully-connected ones, and sum these values to obtain the final 138M number.
-
UPDATE: the breakdown among layers give:
conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544
In particular for the fully-connected layers (fc):
fc1 (x): (512x7x7)x4,096 (weights) + 4,096 (biases) fc2 : 4,096x4,096 (weights) + 4,096 (biases) fc3 : 4,096x1,000 (weights) + 1,000 (biases)
(x) see section 3.2 of the article: the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).
Details about fc1
As precised above the spatial resolution right before feeding the fully-connected layers is 7x7 pixels. This is because this VGG Net uses spatial padding before convolutions, as detailed within section 2.1 of the paper:
[...] the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv. layers.
With such a padding, and working with a 224x224 pixels input image, the resolution decreases as follow along the layers: 112x112, 56x56, 28x28, 14x14 and 7x7 after the last convolution/pooling stage which has 512 feature maps.
This gives a feature vector passed to fc1
with dimension: 512x7x7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With