I notice that in the keras
documentation there are many different types of Conv
layers, i.e. Conv1D
, Conv2D
, Conv3D
.
All of them have parameters like filters
, kernel_size
, strides
, and padding
, which aren't present in other keras
layers.
I have seen images like this which "visualize" Conv
layers,
but I don't understand what's going on in the transition from one layer to the next.
How do changing the above parameters and the dimensions of our Conv
layer affect what happens in the model?
Convolutions - Language Agnostic Basics
To understand how convolutions work in keras we need a basic understanding of how convolutions work in a language-agnostic setting.
Convolutional layers slide across an input to build an activation map (also called feature map). The above is an example of a 2D convolution. Notice how at each step, the 3 X 3 dark square slides across the input (blue), and for each new 3 x 3 part of the input it analyzes, it outputs a value in our output activation map (blue-green boxes at the top).
Kernels and filters
The dark square is our kernel
. The kernel
is a matrix of weights that is multiplied with each part of our input. All the results from these multiplications put together form our activation map.
Intuitively, our kernel
lets us reuse parameters - a weights matrix that detects an eye in this part of the image will work to detect it elsewhere; there's no point in training different parameters for each part of our input when one kernel
can sweep across and work everywhere. We can consider each kernel
as a feature-detector for one feature, and it's output activation map as a map of how likely that feature is present in each part of your input.
A synonym for kernel
is filter
. The parameter filters
is asking for the number of kernels
(feature-detectors) in that Conv
layer. This number will also be the the size of the last dimension in your output, i.e. filters=10
will result in an output shape of (???, 10)
. This is because the output of each Conv
layer is a set of activation maps, and there will be filters
number of activation maps.
Kernel Size
The kernel_size
is well, the size for each kernel. We discussed earlier that each kernel
consists of a weights matrix that is tuned to detect certain features better and better. kernel_size
dictates the size of filter mask. In English, how much "input" is processed during each convolution. For example our above diagram processes a 3 x 3 chunk of the input each time. Thus, it has a kernel_size
of (3, 3)
. We can also call the above operation a "3x3 convolution"
Larger kernel sizes are almost unconstrained in the features they represent, while smaller ones are restricted to specific low-level features. Note though that multiple layers of small kernel sizes can emulate the effect of a larger kernel size.
Strides
Notice how our above kernel
shifts two units each time. The amount that the kernel
"shifts" for each computation is called strides
, so in keras
speak our strides=2
. Generally speaking, as we increase the number of strides
, our model loses more information from one layer to the next, due to the activation map having "gaps".
Padding
Going back to the above diagram, notice the ring of white squares surrounding our input. This is our padding
. Without padding, each time we pass our input through a Conv
layer, the shape of our result gets smaller and smaller. As a result we pad
our input with a ring of zeros, which serves a few purposes:
Preserve information around the edges. From our diagram notice how each corner white square only goes through the convolution once, while center squares go through four times. Adding padding alleviates this problem - squares originally on the edge get convolved more times.
padding
is a way to control the shape of our output. We can make shapes easier to work with by keeping the output of each Conv
layer have the same shape as our input, and we can make deeper models when our shape doesn't get smaller each time we use a Conv
layer.
keras
provides three different types of padding. The explanations in the docs are very straightforward so they are copied / paraphrased here. These are passed in with padding=...
, i.e. padding="valid"
.
valid
: no padding
same
: padding the input such that the output has the same length as the original input
causal
: results in causal (dialated convolutions). Normally in the above diagram the "center" of the kernel maps to the value in the output activation map. With causal convolutions the right edge is used instead. This is useful for temporal data, where you don't want to use future data to model present data.
Conv1D, Conv2D, and Conv3D
Intuitively the operations that occur on these layers remain the same. Each kernel
still slides across your input, each filter
outputs an activation map for it's own feature, and the padding
is still applied.
The difference is the number of dimensions that are convolved. For example in Conv1D
a 1D kernel
slides across one axis. In Conv2D
a 2D kernel
slides across a two axes.
It is very important to note that the D in an X-D Conv layer doesn't denote the number of dimensions of the input, but rather the number of axes that the kernel slides across.
For example, in the above diagram, even though the input is 3D (image with RGB channels), this is an example of a Conv2D
layer. This is because there are two spatial dimensions - (rows, cols)
, and the filter only slides along those two dimensions. You can consider this as being convolutional in the spatial dimensions and fully connected in the channels dimension.
The output for each filter is also two dimensional. This is because each filter slides in two dimensions, creating a two dimensional output. As a result you can also think of an N-D Conv
as each filter outputting an N-D vector.
You can see the same thing with Conv1D
(pictured above). While the input is two dimensional, the filter only slides along one axis, making this a 1D convolution.
In keras
, this means that ConvND
will requires each sample to have N+1
dimensions - N
dimensions for the filter to slide across and one additional channels
dimension.
TLDR - Keras wrap up
filters
: The amount of different kernels
in the layer. Each kernel
detects and outputs an activation map for a specific feature, making this the last value in the output shape. I.e. Conv1D
outputs (batch, steps, filters)
.
kernel_size
: Determines the dimensions of each kernel
/ filter
/ feature-detector. Also determines how much of the input is used to calculate each value in the output. Larger size = detecting more complex features, less constraints; however it's prone to overfitting.
strides
: How many units you move to take the next convolution. Bigger strides
= more information loss.
padding
: Either "valid"
, "causal"
, or "same"
. Determines if and how to pad the input with zeros.
1D vs 2D vs 3D
: Denotes the number of axes that the kernel slides across. An N-D Conv
layer will output an N-D
output for each filter, but will require an N+1 dimensional input for each sample. This is composed from N
dimensions to side across plus one additional channels
dimension.
References:
Intuitive understanding of 1D, 2D, and 3D Convolutions in Convolutional Neural Networks
https://keras.io/layers/convolutional/
http://cs231n.github.io/convolutional-networks/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With