Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do filters run across an RGB image, in first layer of a CNN?

I was looking at this printout of layers. I realized, this shows input / output, but nothing about how the RGB channels are dealt with.

enter image description here

If you look at block1_conv1, it says "Conv2D". But if the input is 224 x 224 x 3, then that's not 2D.

By my bigger, broader question, is how are 3 channel inputs treated throughout the course of training a model like this (I think it's VGG16). Are the RGB channels combined (summed, or concatenated) at some point? When and where? Does that require some unique filter for that purpose? Or does the model run across the different channel/color representations separately from end to end?

like image 398
Monica Heddneck Avatar asked Jun 21 '20 03:06

Monica Heddneck


2 Answers

The "2D" part of a 2D convolution does not refer to the dimension of the convolution input, nor of dimension of the filter itself, but rather of the space in which the filter is allowed to move (2 directions only). Another way of thinking about this is that each of the RGB channels has its own 2D-array filter separately and the output is added at the end.

does the model run across the different channel/color representations separately from end to end?

Effectively it does this across each channel separately. For example, the first Conv2D layer takes in each of the 3 224x224 layers separately, and then applies different 2D-array filters to each one. But this isn't end-to-end across all model layers, only within the layer during the convolution step.

But, you might ask, there are 64 convolution filters for each channel, so why are there not 3*64 = 192 channels in the Conv2D output for the 3 channels? This prompts your question

Are the RGB channels combined (summed, or concatenated) at some point?

The answer is: yes. After the convolution filter is applied to each layer separately, the values for each of the three channels are added, and then a bias too if you've specified that. See the diagram below (from Dive Into Deep Learning, under CC BY-SA 4.0):

Dive Into Deep Learning Diagram

The reason for this (that the separate channel layers are added) is that there aren't actually 3 separate 2D-array filters for each channel; technically there's just 1 3D-array filter which only moves in two directions. You could think of it a bit like a hamburger: there's one 2D-array filter for one channel (the bun), another for the next channel (the lettuce), etc. but all the layers are stacked up and function as a whole, so the filter weights are added all together at once. There's no need to add a special weight or filter for this, since the weights are already present during the convolution step (this would just be multiplying two fitting parameters together, which may as well be one).

like image 54
nathan.j.mcdougall Avatar answered Sep 25 '22 00:09

nathan.j.mcdougall


2D convolutions?

This is a good resource.

If you look at block1_conv1, it says "Conv2D". But if the input is 224 x 224 x 3, then that's not 2D.

You misunderstand the meaning of 2D convolution. The 2D convolutional moves across your input along the height and width of each channel. It never gets moved in the 3rd dimension. Because it moves in this plane, it is a 2D convoltuion. The actual filter is 3D, yes.


But we effectively have 3 images (red, green and blue channels). How are they used to generate just 1D output?

by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores. Stanford CS231n's website

The outputs of each layer of the filter convolved with its respective input layer is matrix summed. A bias is added to all elements, if you want. You might not. So, if you have 3 input channels and 1 filter, you get X-Y-1. If you have 2 (or 1, or 3, or 4 or 1000) input channels and 15 filters, you just get X-Y-15. X and Y are the output heights which will depend on other parameters of the convolution.


3D convolutions

3D convolutions are different, in that each filter is 3D and moves across different channels, so each weight is used in all channels.


What are 1x1 convolutions then?

Its a trick used to reduce the computational cost of a network, by reducing the number of channels (reducing the dimensionality of the input), without doing any 'real' convolution/ calculations. This is useful because fewer channels means fewer weights need to be calculated for computation that uses this output as input, such as bigger (3x3 or 5x5) convolutions.

like image 42
Ben Butterworth Avatar answered Sep 26 '22 00:09

Ben Butterworth