I read about convolutional neural networks from here. Then I started playing with torch7. I am having confusion with the convolutional layer of a CNN.
From the tutorial,
The neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner.
For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights.
if the input layer is [32x32x3], CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and the region they are connected to in the input volume. This may result in volume such as [32x32x12].
I started playing with what a CONV layer might do to an image. I did that in torch7. Here is my implementation,
require 'image'
require 'nn'
i = image.lena()
model = nn.Sequential()
model:add(nn.SpatialConvolutionMM(3, 10, 5, 5)) --depth = 3, #output layer = 10, filter = 5x5
res = model:forward(i)
itorch.image(res)
print(#i)
print(#res)
output
3
512
512
[torch.LongStorage of size 3]
10
508
508
[torch.LongStorage of size 3]
Now lets see the structure of a CNN
So, my questions are,
Is the convolution done like this - lets say we take an image 32x32x3. And there is 5x5 filter. Then the 5x5 filter will pass through the whole 32x32 image and produce the convoluted images? Okay, so sliding 5x5 filter across the whole image, we get one image, if there are 10 output layers, we get 10 images(as you see from the output). How do we get these? (see the image for clarification if required)
What is the number of neurons in the conv layer? Is it the number of output layers? In the code I've written above, model:add(nn.SpatialConvolutionMM(3, 10, 5, 5))
. Is it 10? (no. of output layers?)
If so the point number 2 does not make any sense. According to that If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights.
So what will be the weight here? I am greatly confused in this. In the model defined in torch, there is no weight. So how the weight is playing a role here?
Can someone explain what is going on?
CNN is a type of neural network model which allows us to extract higher representations for the image content. Unlike the classical image recognition where you define the image features yourself, CNN takes the image's raw pixel data, trains the model, then extracts the features automatically for better classification.
A good way to increase the generalization is the regularization of the architecture. It is implemented by the modification of structure, as well as using different methods of learning.
TL;DR: Modern deep CNNs are not invariant to translations, scalings and other realistic image transformations, and this lack of invariance is related to the subsampling operation and the biases contained in image datasets.
“What is wrong with 'standard' Convolutional Neural Nets? They have too few levels of structure: Neurons, Layers, and Whole Nets. We need to group neurons in each layer in 'capsules' that do a lot of internal computation and then output a compact result.”
Is the convolution done like this - lets say we take an image 32x32x3. And there is 5x5 filter. Then the 5x5 filter will pass through the whole 32x32 image and produce the convoluted images?
For a 32x32x3 input image a 5x5 filter would iterate over every single pixel and for each pixel look at the 5x5 neighborhood. That neighborhood contains 5*5*3=75 values. Below is an example image for a 3x3 filter over a single input channel, i.e. one with a neighborhood of 3*3*1 values (source).
For each individual neighbor the filter will have one parameter (aka weight), so 75 parameters. Then to calculate one single output value (value at pixel x, y) it reads those neighbor values, multiplies each one with the respective parameter/weight and adds those up at the end (see discrete convolution). The optimal weights have to be learned during the training.
So one filter will iterate over the image and generate a new output, pixel by pixel. If you have multiple filters (i.e. the second parameter in SpatialConvolutionMM
is >1) you get multiple outputs ("planes" in torch).
Okay, so sliding 5x5 filter across the whole image, we get one image, if there are 10 output layers, we get 10 images(as you see from the output). How do we get these? (see the image for clarification if required)
Each output plane gets generated by its own filter. Each filter has its own parameters (5*5*3 parameters in your example). The process for multiple filters is exactly the same as it is for one.
What is the number of neurons in the conv layer? Is it the number of output layers? In the code I've written above, model:add(nn.SpatialConvolutionMM(3, 10, 5, 5)). Is it 10? (no. of output layers?)
You should call them weights or parameters, "neurons" doesn't really fit for convolutional layers. The number of parameters is, as described, 5*5*3=75 per filter in your example. As you have 10 filters ("output planes") you have 750 parameters total. If you add a second layer to your network with model:add(nn.SpatialConvolutionMM(10, 10, 5, 5))
you would have an additional 5*5*10=250 parameters per filter and 250*10=2500 total. Notice how that number can quickly grow (512 filters/output planes in one layer operating on 256 input planes is nothing uncommon).
For further reading you should look at http://neuralnetworksanddeeplearning.com/chap6.html . Scroll down to the chapter "Introducing convolutional networks". Under "Local receptive fields" there are visualizations which will probably help you understand what a filter does (one was shown above).
Disclaimer: The information I have provided below is mostly extracted from the following papers: Information Processing in Cats Visual Cortex Gradient Based Learning Applied to Document Recognition Neocortigan Receptive Fields in the Cat's Visual Cortex
Is the convolution done like this - lets say we take an image 32x32x3. And there is 5x5 filter. Then the 5x5 filter will pass through the whole 32x32 image and produce the convoluted images?
Yes, a 5x5 filter will pass through the whole image creating a 28x28 RGB image. Each unit in the so called "feature map" receives 5x5x3 inputs connected to a 5x5 area in the input image ( this 5x5 neighborhood is called the unit's "local receptive field"). Receptive fields of neighboring(adjacent) units in a feature map are centered on neighboring(adjacent) units in the previous layer.
Okay, so sliding 5x5 filter across the whole image, we get one image, if there are 10 output layers, we get 10 images(as you see from the output). How do we get these? (see the image for clarification if required)
Note that units on the feature map layer share the same set of wights and perform the same operation on different parts of image.( That is if you shift the original image, the output on feature map will also shift by the same amount). That is, for every feature map you constraint the set of weights to be the same for every unit; you only have 5x5x3 unknown weights.
Due to this constraint, and since we want to extract as much information as possible from the image, we add more layers, features maps: having multiple feature maps helps us extract multiple features at every pixel.
Unfortunately I am not familiar with Torch7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With