I'm reading paper about using CNN(Convolutional neural network) for object detection.
Rich feature hierarchies for accurate object detection and semantic segmentation
Here is a quote about receptive field:
The pool5 feature map is 6x6x256 = 9216 dimensional. Ignoring boundary effects, each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input. A central pool5 unit has a nearly global view, while one near the edge has a smaller, clipped support.
My questions are:
1) It is the size of the area of pixels that impact the output of the last convolution. 3) You don't need to use a library to do it. For every 2x2 pooling the output size is reduced by half along each dimension. For strided convolutions, you also divide the size of each dimension by the stride.
Similarly, in a deep learning context, the Receptive Field (RF) is defined as the size of the region in the input that produces the feature[3]. Basically, it is a measure of association of an output feature (of any layer) to the input region (patch).
If all strides are 1, then the receptive field will simply be the sum of (kl−1) ( k l − 1 ) over all layers, plus 1, which is simple to see.
1) It is the size of the area of pixels that impact the output of the last convolution.
2) For each convolution and pooling operation, compute the size of the output. Now find the input size that results in an output size of 1x1. Thats the size of the receptive field
3) You don't need to use a library to do it. For every 2x2 pooling the output size is reduced by half along each dimension. For strided convolutions, you also divide the size of each dimension by the stride. You may have to shave off some of the dimension depending on if you use padding for your convolutions. The simplest case is to use padding = floor(kernel size/2), so that a convolution dose not have any extra change on the output size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With