Why does the global average pooling work in ResNet?

Question

Lately, I start a project about classification, using a very shallow ResNet. The model just has 10 conv. layer and then connects a Global avg pooling layer before softmax layer.

The performance is good as my expectation --- 93% (yeah, it is ok).

However, for some reasons, I need replace the Global avg pooling layer.

I have tried the following ways:

(Given the input shape of this layer [-1, 128, 1, 32], tensorflow form)

Global max pooling layer. but got 85% ACC

Exponential Moving Average. but got 12% (almost didn't work)

 split_list = tf.split(input, 128, axis=1)
 avg_pool = split_list[0]
 beta = 0.5
 for i in range(1, 128):
     avg_pool = beta*split_list[i] + (1-beta)*avg_pool
 avg_pool = tf.reshape(avg_pool, [-1,32])

Split input into 4 parts, avg_pool each parts, finally concatenate them. but got 75%

 split_shape = [32,32,32,32]
 split_list = tf.split(input, 
                       split_shape, 
                       axis=1)
 for i in range(len(split_shape)):
     split_list[i] = tf.keras.layers.GlobalMaxPooling2D()(split_list[i])
 avg_pool = tf.concat(split_list, axis=1)

Average the last channel. [-1, 128, 1, 32] --> [-1, 128], didn't work. ^
Use a conv. layer with 1 kernel. In this way, the output shape is [-1, 128, 1, 1]. but didn't work, 25% or so.

I am pretty confused why global average pooling can work that well? And is there any other way to replace it?

Admin · Accepted Answer

Global Average Pooling has the following advantages over the fully connected final layers paradigm:

The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

Why does the global average pooling work in ResNet?

Tags:

python

tensorflow

deep-learning

conv-neural-network

resnet

Yong Wang

1 Answers

Recent Activity

Donate For Us

Why does the global average pooling work in ResNet?

Tags:

python

tensorflow

deep-learning

conv-neural-network

resnet

Yong Wang

1 Answers

Related questions

Recent Activity

Donate For Us