The animation is from here. I am wondering why the dilated convolution is claimed to preserve resolution. Apparently the input in blue is 7x7 and the output in green is 3x3.
EDIT:
One way to work around the resolution loss is to pad the input with roughly half the size of the current receptive field, but
Its advantage is that the dilated convolution can first capture intrinsical sequence information by expanding the field of convolution kernel without increasing the parameter amount of the model.
Dilated Convolutions are a type of convolution that “inflate” the kernel by inserting holes between the kernel elements. An additional parameter (dilation rate) indicates how much the kernel is widened. There are usually spaces inserted between kernel elements.
Atrous convolution is an alternative for the down sampling layer. It increases the receptive field whilst maintains the spatial dimension of feature maps.
Convolution Arithmetic. Transposed Convolution (Deconvolution, checkerboard artifacts) Dilated Convolution (Atrous Convolution) Separable Convolution (Spatially Separable Convolution, Depthwise Convolution)
This is indeed a dilated convolution with a 5x5 filter. If you imagine the blue part of the animation as a 3x3 image that's 0 padded, it preserves resolution.
With regard to your edit, the emphasis is really in this statement in the post you linked: dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage
Padding is done to preserve resolution. That is correct.
What we really want here is to expand the size of the receptive field. In the post you linked, with 3 3x3 dilated convolutions at an increasing dilation, we already achieve a receptive field of 15x15 in the feature maps.
To achieve the equivalent with 3x3 convolutions and no loss of coverage and no loss of resolution, we can do it with a stride of 3 (4 would result in loss of coverage) and extremely heavy padding (to the extent where it's like what you said, convolution with mostly padded zeroes). However, we will need 4 3x3 convolutions with stride 3 instead of 3 to achieve a receptive field of 15x15.
On top of that, the normal convolutions would have even more convolutions that don't make sense, compared to the dilated convolution case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With