Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the output shape of conv2d layer in keras

I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.

I have the following model

def create_model():
    image = Input(shape=(128,128,3))

    x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
    x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
    x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
    flatten = Flatten(name='flatten')(x)

    output = Dense(1, activation='relu', name='output')(flatten)
    model = Model(input=image, output=output)
    return model

model = create_model()
model.summary()

The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?

enter image description here

like image 858
MachineLearner Avatar asked Mar 31 '19 18:03

MachineLearner


2 Answers

This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:

SIMPLE ANSWER: The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.

An example, from the keras.io website cifar CNN example:

(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)

(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:

Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))

(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.

(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.

(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).

(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).

As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():

Layer (type)         Output shape       Param#
conv2d_1 (Conv2D)   (None, 32, 32, 32)  896

There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.

like image 107
Bambam Avatar answered Sep 21 '22 20:09

Bambam


Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.

See this quote from CS231 enter image description here

Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in the text below. Right: The neurons from the Neural Network chapter remains unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially.

like image 34
MachineLearner Avatar answered Sep 18 '22 20:09

MachineLearner