I have been going through the paper, Multi-Scale Context Aggregation by Dilated Convolutions.
In it they propose using dilated convolution to get global context as opposed to max-pooling/downsampling since pooling will shrink your image and dilated convolutions will not.
My first question is: They modify VGG16 and remove the last two max-pooling layers but they leave the other 3 max-pooling layers in. Why did they not just remove all the max pooling layers? Computational efficiency? Won't this result in a smaller image? How do they expand it back to the original size, bilinear interpolation?
My second question is: They note in the paper:
"We also remove the padding of the intermediate feature maps. Intermediate padding was used in the original classification network, but is neither necessary nor justified in dense prediction."
Why would that be the case, if you don't pad will you not further reduce the size of our final output, especially given that dilated convolutions can have very large receptive fields?
Answering your first question, I think you are correct, the output is 1/8th the original size and they use interpolation to upsample to original size. You can find the evidence in source code available here. In the file test.py, function test_image, the default zoom is set at 8 (line 103). More evidence can be found in the file train.py where again default zoom is set to True and they use an upsampling layer.
And since they are already reducing size they dont need to use padding just to retain size. The reason I think padding is not needed in their case is that segmentation is a case of dense prediction and thus introducing some pixels from our own side don't intuitively make sense. But again the best way to argue about the same would be to practically test a network both with and without intermediate pooling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With