I was reading this from a paper: "Rather than using relatively large receptive fields in the first conv. layers we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field." How do they end up with a recpetive field of 7x7 ? This is how i understand it: Suppose that we have one image that is 100x100. 1st layer: zero-pad the image and convole it with the 3x3 filter, output another 100x100 filtered image. 2nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output another 100x100 filtered image. 3nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output the final 100x100 filtered image. What am I missing there ?

Here's one way to think of it. Consider the following small image, with each pixel numbered as such: <pre class="prettyprint"><code>00 01 02 03 04 05 06 10 11 12 13 14 15 16 20 21 22 23 24 25 26 30 31 32 33 34 35 36 40 41 42 43 44 45 46 50 51 52 53 54 55 56 60 61 62 63 64 65 66 </code></pre> Now consider the pixel 33 at the center. With the first 3x3 convolution, the generated value at pixel 33 will incorporate the values of pixels 22, 23, 24, 32, 33, 34, 42, 43, and 44. But notice that each of those pixels will also incorporate their surrounding pixels' values as well. With the next 3x3 convolution, pixel 33 will again incorporate the values of its surrounding pixels, but now, the value of those pixels incorporates their surrounding pixels from the original image. In effect, this means that the value of pixel 33 is governed by the values reaching out to a 5x5 "square of influence" you could say. Each additional 3x3 convolution has the effect of stretching the effective receptive field by another pixel in each direction. I hope that didn't just make it more confusing...

Receptive Fields on ConvNets (Receptive Field size confusion)

Tags:

machine-learning

neural-network

deep-learning

computer-vision

conv-neural-network

I was reading this from a paper: "Rather than using relatively large receptive fields in the first conv. layers we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field."

How do they end up with a recpetive field of 7x7 ?

This is how i understand it: Suppose that we have one image that is 100x100.

1st layer: zero-pad the image and convole it with the 3x3 filter, output another 100x100 filtered image.

2nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output another 100x100 filtered image.

3nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output the final 100x100 filtered image.

What am I missing there ?

527

asked May 10 '16 11:05

Sprk

1 Answers

Here's one way to think of it. Consider the following small image, with each pixel numbered as such:

00 01 02 03 04 05 06
10 11 12 13 14 15 16
20 21 22 23 24 25 26
30 31 32 33 34 35 36
40 41 42 43 44 45 46
50 51 52 53 54 55 56
60 61 62 63 64 65 66

Now consider the pixel 33 at the center. With the first 3x3 convolution, the generated value at pixel 33 will incorporate the values of pixels 22, 23, 24, 32, 33, 34, 42, 43, and 44. But notice that each of those pixels will also incorporate their surrounding pixels' values as well.

With the next 3x3 convolution, pixel 33 will again incorporate the values of its surrounding pixels, but now, the value of those pixels incorporates their surrounding pixels from the original image. In effect, this means that the value of pixel 33 is governed by the values reaching out to a 5x5 "square of influence" you could say.

Each additional 3x3 convolution has the effect of stretching the effective receptive field by another pixel in each direction.

I hope that didn't just make it more confusing...

answered Oct 06 '22 01:10

Aenimated1

Related questions
                            
                                R - Calculate Test MSE given a trained model from a training set and a test set
                            
                                Compute similarity percentage OR Compute correlation between more than 2 objects
                            
                                pytorch Network.parameters() missing 1 required positional argument: 'self'
                            
                                is there any way to get samples under each leaf of a decision tree?
                            
                                TensorFlow average gradients over several batches
                            
                                What to do when Seq2Seq network repeats words over and over in output?
                            
                                Algorithms to find stuff a user would like based on other users likes
                            
                                Algorithm to generate numerical concept hierarchy
                            
                                Periodic Data with Machine Learning (Like Degree Angles -> 179 is 2 different from -179)
                            
                                Text tokenization with Stanford NLP : Filter unrequired words and characters
                            
                                Simple accord.net machine learning example
                            
                                Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram
                            
                                What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]
                            
                                How can I tell which languages are available for text recognition in Apple's Vision framework?
                            
                                Difference between Keras' BatchNormalization and PyTorch's BatchNorm2d?
                            
                                Create Artificial Data in MATLAB
                            
                                Train and test set are not compatible error in weka?
                            
                                ImportError: No module named arff
                            
                                Difference in values of tf-idf matrix using scikit-learn and hand calculation
                            
                                How to handle missing NaNs for machine learning in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With