Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the use of dilated convolutions?

I refer to Multi-Scale Context Aggregation by Dilated Convolutions.

  • A 2x2 kernel would have holes in it such that it becomes a 3x3 kernel.
  • A 3x3 kernel would have holes in it such that it becomes a 5x5 kernel.
  • Above assumes interval 1 of course.

I can clearly see that this allows you to effectively use 4 parameters but have a receptive field of 3x3 and 9 parameters but have a receptive field of 5x5.

Is the case of dilated convolution simply to save on parameters while reaping the benefit of a larger receptive field and thus save memory and computations?

like image 874
jkschin Avatar asked Dec 16 '16 06:12

jkschin


People also ask

What are dilated or atrous convolutions?

Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels.

Does dilated convolution reduces the dimension of the data?

it preserves the resolution/dimensions of data at the output layer. This is because the layers are dilated instead of pooling, hence the name dilated causal convolutions. it maintains the ordering of data.

What is the purpose of convolution in CNN?

Convolution is a mathematical operation that allows the merging of two sets of information. In the case of CNN, convolution is applied to the input data to filter the information and produce a feature map. This filter is also called a kernel, or feature detector, and its dimensions can be, for example, 3x3.

What is dilation rate in convolution?

Dilated convolutions introduce another parameter to convolutional layers called the dilation rate. This defines a spacing between the values in a kernel. A 3x3 kernel with a dilation rate of 2 will have the same field of view as a 5x5 kernel, while only using 9 parameters.


2 Answers

TLDR

  1. Dilated convolutions have generally improved performance (see the better semantic segmentation results in Multi-Scale Context Aggregation by Dilated Convolutions)
  2. The more important point is that the architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage.

  3. Allows one to have larger receptive field with same computation and memory costs while also preserving resolution.

  4. Pooling and Strided Convolutions are similar concepts but both reduce the resolution.

@Rahul referenced WaveNet, which put it very succinctly in 2.1 Dilated Causal Convolutions. It is also worth looking at Multi-Scale Context Aggregation by Dilated Convolutions I break it down further here:

Screenshot from https://arxiv.org/pdf/1511.07122.pdf

  • Figure (a) is a 1-dilated 3x3 convolution filter. In other words, it's a standard 3x3 convolution filter.
  • Figure (b) is a 2-dilated 3x3 convolution filter. The red dots are where the weights are and everywhere else is 0. In other words, it's a 5x5 convolution filter with 9 non-zero weights and everywhere else 0, as mentioned in the question. The receptive field in this case is 7x7 because each unit in the previous output has a receptive field of 3x3. The highlighted portions in blue show the receptive field and NOT the convolution filter (you could see it as a convolution filter if you wanted to but it's not helpful).
  • Figure (c) is a 4-dilated 3x3 convolution filter. It's a 9x9 convolution filter with 9 non-zeros weights and everywhere else 0. From (b), we have it that each unit now has a 7x7 receptive field, and hence you can see a 7x7 blue portion around each red dot.

To draw an explicit contrast, consider this:

  • If we use 3 successive layers of 3x3 convolution filters with stride of 1, the effective receptive field will only be 7x7 at the end of it. However, with the same computation and memory costs, we can achieve 15x15 with dilated convolutions. Both operations preserve resolution.
  • If we use 3 successive layers of 3x3 convolution filters with increasing stride at an exponential rate at exactly the same rate as dilated convolutions in the paper, we will get a 15x15 receptive field at the end of it but with loss of coverage eventually as the stride gets larger. What this loss of coverage means is that the effective receptive field at some point will not be what we see above. Some parts will not be overlapping.
like image 175
jkschin Avatar answered Oct 08 '22 02:10

jkschin


In addition to the benefits you already mentioned such as larger receptive field, efficient computation and lesser memory consumption, the dilated causal convolutions also has the following benefits:

  • it preserves the resolution/dimensions of data at the output layer. This is because the layers are dilated instead of pooling, hence the name dilated causal convolutions.
  • it maintains the ordering of data. For example, in 1D dilated causal convolutions when the prediction of output depends on previous inputs then the structure of convolution helps in maintaining the ordering of data.

I'd refer you to read this amazing paper WaveNet which applies dilated causal convolutions to raw audio waveform for generating speech, music and even recognize speech from raw audio waveform.

I hope you find this answer helpful.

like image 40
Rahul Avatar answered Oct 08 '22 00:10

Rahul