I'm looking at InceptionV3 (GoogLeNet) architecture and cannot understand why do we need conv1x1 layers?
I know how convolution works, but I see a profit with patch size > 1.
The 1x1 convolution can be used to address this issue by offering filter-wise pooling, acting as a projection layer that pools (or projects) information across channels and enables dimensionality reduction by reducing the number of filters whilst retaining important, feature-related information.
In other words, 1X1 Conv was used to reduce the number of channels while introducing non-linearity. In 1X1 Convolution simply means the filter is of size 1X1 (Yes — that means a single number as opposed to matrix like, say 3X3 filter). This 1X1 filter will convolve over the ENTIRE input image pixel by pixel.
In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural language processing, signal processing, and various other purposes The ...
A Convolutional neural network (CNN) is a neural network that has one or more convolutional layers and are used mainly for image processing, classification, segmentation and also for other auto correlated data. A convolution is essentially sliding a filter over the input.
You can think about 1x1xD
convolution as a dimensionality reduction technique when it's placed somewhere into a network.
If you have an input volume of 100x100x512
and you convolve it with a set of D
filters each one with size 1x1x512
you reduce the number of features from 512 to D. The output volume is, therefore, 100x100xD
.
As you can see this (1x1x512)xD
convolution is mathematically equivalent to a fully connected layer. The main difference is that whilst FC layer requires the input to have a fixed size, the convolutional layer can accept in input every volume with spatial extent greater or equal than 100x100
.
A 1x1xD
convolution can substitute any fully connected layer because of this equivalence.
In addition, 1x1xD
convolutions not only reduce the features in input to the next layer, but also introduces new parameters and new non-linearity into the network that will help to increase model accuracy.
When the 1x1xD
convolution is placed at the end of a classification network, it acts exactly as a FC layer, but instead of thinking about it as a dimensionality reduction technique it's more intuitive to think about it as a layer that will output a tensor with shape WxHxnum_classes
.
The spatial extent of the output tensor (identified by W
and H
) is dynamic and is determined by the locations of the input image that the network analyzed.
If the network has been defined with an input of 200x200x3
and we give it in input an image with this size, the output will be a map with W = H = 1
and depth = num_classes
. But, if the input image have a spatial extent greater than 200x200
than the convolutional network will analyze different locations of the input image (just like a standard convolution does) and will produce a tensor with W > 1
and H > 1
. This is not possibile with a FC layer that constrains the network to accept fixed size input and produce fixed size output.
A 1x1 convolution simply maps in input pixel to an output pixel, not looking at anything around itself. It is often used to reduce the number of depth channels, since it is often very slow to multiply volumes with extremely large depths.
input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth) input (256 depth) -> 4x4 convolution (256 depth)
The bottom one is about ~3.7x slower.
Theoretically the neural network can 'choose' which input 'colors' to look at using this, instead of brute force multiplying everything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With