I am reading through Residual learning, and I have a question. What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?
First up, it's important to understand what x
, y
and F
are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.
x
is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F
is usually a conv layer (conv+relu+batchnorm
in this paper), and y
combines the two together (forming the output channel). The result of F
is also of rank 4, and most of dimensions will be the same as in x
, except for one. That's exactly what the transformation should patch.
For example, x
shape might be (64, 32, 32, 3)
, where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x)
might be (64, 32, 32, 16)
: batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.
So, in order for y=F(x)+x
to be a valid operation, x
must be "reshaped" from (64, 32, 32, 3)
to (64, 32, 32, 16)
.
I'd like to stress here that "reshaping" here is not what numpy.reshape
does.
Instead, x[3]
is padded with 13 zeros, like this:
pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x
dimensions are changed.
Here's the link to the code in Tensorflow that does this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With