What is the advantage of using multiples of the same filter in convolutional networks in deep learning?
For example: We use 6 filter of size [5,5] at the first layer to scan the image data, which is a matrix of size [28,28]. The question is why do we not use only a single filter of size [5,5] but use 6 or more of them. In the end they will scan the exact same pixels. I can see that the random weight might be different but DL model will adjust to it anyway.
So, specifically what is the main advantage and purpose of using multiple filters of the same shape then in convnets?
Convolutional neural networks do not learn a single filter; they, in fact, learn multiple features in parallel for a given input. For example, it is common for a convolutional layer to learn from 32 to 512 filters in parallel for a given input.
Learning a single filter specific to a machine learning task is a powerful technique. Yet, convolutional neural networks achieve much more in practice. Convolutional neural networks do not learn a single filter; they, in fact, learn multiple features in parallel for a given input.
The answer specified 3 convolution layer with different numbers of filters and size, Again in this question : number of feature maps in convolutional neural networks you can see from the picture that, we have 28*28*6 filters for the first layer and 10*10*16 filter for the second conv layer.
This means that if a convolutional layer has 32 filters, these 32 filters are not just two-dimensional for the two-dimensional image input, but are also three-dimensional, having specific filter weights for each of the three channels. Yet, each filter results in a single feature map.
First, the kernel shape is the same merely to speed up computation. This allows to apply the convolution in a batch, for example using col2im transformation and matrix multiplication. This also makes it convenient to store all the weights in one multidimensional array. Though mathematically one can imagine using several filters of different shape.
Some architectures, such as Inception network, use this idea and apply different convolutional layers (with different kernels) in parallel and in the end stack up the feature maps. This turned out to be very useful.
Because each filter is going to learn exactly one pattern that will excite it, e.g., Gabor-like vertical line. A single filter can't be equally excited by a horizontal and a vertical line. So to recognize an object, one such filter is not enough.
For example, in order to recognize a cat, a neural network might need to recognize the eyes, the tail, ... of all which are composed of different lines and edges. The network can be confident about the object on the image if it can recognize a whole variety of different shapes and patterns in the image. This will be true even for a simple data set like MNIST.
A simple analogy: imagine a linear regression network with one hidden layer. Each neuron in the hidden layer is connected to each input feature, so they are all symmetrical. But after some training, different neurons are going to learn different high-level features, which are useful to make a correct prediction.
There's a catch: if the network is initialized with zeros, it's going to suffer from symmetry issues and in general won't converge to the target distribution. So it's essential to create asymmetry in the neurons from the very beginning and let different neurons get excited differently from the same input data. This in turn leads to different gradients getting applied to the weights, usually increasing the asymmetry even more. That's why different neurons are trained differently.
It's important to mention another issue that is still possible with random init called co-adaptation: when different neurons learn to adapt and depend on each other. This problem has been solved by a dropout technique and later by batch normalization, essentially by adding noise to the training process, in various ways. Combining it together, neurons are much more likely to learn different latent representations of the data.
Highly recommend to read CS231n tutorial by Stanford to gain better intuition about convolutional neural networks.
Zeiler and Fergus https://arxiv.org/pdf/1311.2901.pdf have a good picture showing kernel response to different parts of a picture.
Each kernel convolves over the image, so all the kernels (potentially) see all the pixels. Each of your 6 filters will "learn" a different feature. In the first layer, some will typically learn line features that look like lines (horizontal, vertical, diagonal) and some will learn colour blobs. In the next layer, these get combined. Pixels into edges into shapes.
It might help to look up Prewitt filters https://en.m.wikipedia.org/wiki/Prewitt_operator In this case, it is a single 3x3 kernel which convolves over the whole image and gives a feature map showing horizontal (or vertical) edges. You need one filter for horizontal and a different filter for vertical, but you can combine them to give both. In a neural network, the kernel values are learned from data but the feature maps at each layer are still produced by convolving the kernel over the input.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With