I'm reading Inception paper by Szegedy et al: https://arxiv.org/abs/1512.00567 and I'm having trouble understanding how they reduce the amount of computation by replacing a single 5x5 filter with 2 layers of 3x3 filters (section 3.1).
In particular, this passage:
If we would naivly slide a network without reusing the computation between neighboring grid tiles, we would increase the computational cost. sliding this network can be represented by two 3x3 convolutional layers which reuses the activations between adjacent tiles.
I don't understand how we can reuse those activations.
I have been working through this confusion as well, seemingly every time I need to revisit the Inception papers.
The proper setting of the comparison is to think about a toy example input image that is 5x5 in shape. To produce a 5x5 output image with a 5x5 convolution, you need to pad the original image with 2 extra padded pixels on the top, bottom and sides, and then proceed with your usual 5x5 convolution. There are 25 weight parameters for the convolution filter, and every pixel of the output required a weighted sum of 25 items of the input.
Now, instead of the 5x5 filter, we will do two stages. First we will pad the original image with 1 extra pixel along the top, bottom and sides, to make it conformable for a standard 3x3 convolution at every point.
This produces an intermediate image, same shape as the input because of the padding, but where each pixel is the result of a 3x3 convolution (so each pixel commanded the weighted sum of 9 items).
Now we will repeat this again for a final stage 3x3 convolution, starting out from our intermediate image from the first 3x3 convolution. Again, we pad by 1 pixel on the top, bottom and sides, and each item of the output was achieved by a weighted sum of 9 items of the input.
The diagram you provided in your question demonstrates how this allows aggregating the same span of spatial information of a 5x5 convolution, but just calculated by a different set of two weighted sums for the two 3x3 convolutions. To be clear, the calculation is not the same, because the two 9-d filters of coefficients don't have to be learned to be the same as the 25-d filter of coefficients. They are likely different weights, but they can sequentially span the same distance of the original image as the 5x5 convolution could.
In the end, we can see that each unit of the output in the 5x5 case required 25 multiply-add ops. Each unit of the final output in the sequential 3x3 case required first 9 multiply-adds to make the units of the first 3x3 convolution and then 9 multiply-adds to make the units of the final output.
The specific comment about "activation sharing" refers to the fact that you only calculate the values of the intermediate 3x3 convolution one time. You spend 9 ops for each of those units, but once they are created, you only spend 9 more ops to get to the final output cells. You don't repeat the effort of creating the first 3x3 convolution over and over for each unit of the final output.
This is why it's not counted as requiring 81 operations per each output unit. When you bump to the next (i, j) position of the final 3x3 convolution output, you are re-using a bunch of pixels of the intermediate 3x3 convolution, so you're only doing 9 more ops to get to the final output each time.
The number of ops for a 5x5 padded convolution of a 5x5 input is 25 * 25.
The number of ops for the first 3x3 padded convolution is 25 * 9, and from there you add the cost of another padded 3x3 convolution, so overall it becomes 25 * 9 + 25 * 9 = 25 * 18.
This is how they get to the ratio of (25 * 25) / (25 * 18) = 25/18.
As it happens, this also is the same reduction for the total number of parameters as well.
I think the key is that the original diagram (from the paper and from your question) does a really bad job of indicating that you would first pay the standard 3x3 convolution cost to create the intermediate set of pixels across the whole original 5x5 input, including with padding. And then you would run the second 3x3 convolution across that intermediate result (this is what they mean by re-using activations).
The picture makes it seem like individually, for every final output pixel, you would slide the original 3x3 convolution around all the right spots for the entire 3x3 intermediate layer, compute the 9-ops weighted sum each time (81 ops overall), and then compute the final 9-ops weighted sum to get a single pixel of output. Then go back to the original, bump the convolutions over by 1 spot, and repeat. But this is not correct, and would not "re-use" the intermediate convolution layer, rather it would separately re-compute it for each unit of the final output layer.
Overall though, I agree this is super non-trivial and hard to think about. The paper really glosses over it and assumes way to much context is already in the mind of the reader.
So first of all, the author states this:
This way, we end up with a net (9+9) / 25 × reduction of computation, resulting in a relative gain of 28% by this factorization.
And he is right: for óne 5x5 filter, you have to use 25 (5*5
) individual weights. For two 3x3 filters, you have to use 9 + 9 (3*3 + 3*3
) individual weights. So using two 3x3 filters requires less parameters. However, you are right, this doesn't mean that it requires less computation: at first glance, using two 3x3 filters requires much more operations.
Let's compare the amount of operations for the two options for a given n*n
input. The walkthrough:
(n - filtersize + 1)^2
), and its corresponding operationsLet's start with a 5x5
input:
1. (5 - 5 + 1)^2 = 1x1. So 1*1*25 operations = 25 operations
2. (5 - 3 + 1)^2 = 3x3. So 3*3*9 operations = 81 operations
3. (3 - 3 + 1)^2 = 1x1. So 1*1*9 operations = 9 operations
So 25 vs 90 operations. Using a single 5x5 filter is best for a 5x5 input.
Next, the 6x6
input:
1. (6 - 5 + 1)^2 = 2x2. So 2*2*25 operations = 100 operations
2. (6 - 3 + 1)^2 = 4x4. So 4*4*9 operations = 144 operations
3. (4 - 3 + 1)^2 = 2x2. So 2*2*9 operations = 36 operations
So 100 vs 180 operations. Using a single 5x5 filter is best for a 6x6 input.
Let's jump one ahead, 8x8
input:
1. (8 - 5 + 1)^2 = 4x4. So 4*4*25 operations = 400 operations
2. (8 - 3 + 1)^2 = 6x6. So 6*6*9 operations = 324 operations
3. (4 - 3 + 1)^2 = 4x4. So 4*4*9 operations = 144 operations
So 400 vs 468 operations. Using a single 5x5 filter is best for a 8x8 input.
Notice the pattern? Given an input size of n*n
, the operations for a 5x5
filter have the following formula:
(n - 4)*(n - 4) * 25
And for 3x3 filter:
(n - 2)*(n - 2) * 9 + (n - 4) * (n - 4) * 9
So let's plot these:
They seem to intersect! As you may read from the graph above, the number of operations seems to be less for two 3x3 filters from n=10
and onwards!
Conclusion: it seems like it is effective to use two 3x3 filters after n=10
. Additionally, regardless of n
, less parameters need to be tuned for two 3x3 filters in comparision with a single 5x5 filter.
The article is kind of weird though, it makes it feel like using two 3x3 filters over óne 5x5 filter 'obvious' for some reason:
This setup clearly reduces the parameter count by sharing the weights between adjacent tiles.
it seems natural to exploit translation invariance again and replace the fully connected component by a two layer convolutional architecture
If we would naivly slide
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With