Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intuition behind Stacking Multiple Conv2D Layers before Dropout in CNN

Background:

Tagging TensorFlow since Keras runs on top of it and this is more a general deep learning question.

I have been working on the Kaggle Digit Recognizer problem and used Keras to train CNN models for the task. This model below has the original CNN structure I used for this competition and it performed okay.

def build_model1():
    model = models.Sequential()

    model.add(layers.Conv2D(32, (3, 3), padding="Same" activation="relu", input_shape=[28, 28, 1]))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Flatten())
    model.add(layers.Dense(64, activation="relu"))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(10, activation="softmax"))

    return model

Then I read some other notebooks on Kaggle and borrowed another CNN structure (copied below), which works much better than the one above in that it achieved better accuracy, lower error rate, and took many more epochs before overfitting the training data.

def build_model2():
    model = models.Sequential()

    model.add(layers.Conv2D(32, (5, 5),padding ='Same', activation='relu', input_shape = (28, 28, 1)))
    model.add(layers.Conv2D(32, (5, 5),padding = 'Same', activation ='relu'))
    model.add(layers.MaxPool2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Conv2D(64,(3, 3),padding = 'Same', activation ='relu'))
    model.add(layers.Conv2D(64, (3, 3),padding = 'Same', activation ='relu'))
    model.add(layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Flatten())
    model.add(layers.Dense(256, activation = "relu"))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(10, activation = "softmax"))

    return model

Question:

Is there any intuition or explanation behind the better performance of the second CNN structure? What is it that makes stacking 2 Conv2D layers better than just using 1 Conv2D layer before max pooling and dropout? Or is there something else that contributes to the result of the second model?

Thank y'all for your time and help.

like image 610
Nahua Kang Avatar asked Oct 01 '17 17:10

Nahua Kang


People also ask

Why do we use multiple Conv2D layers?

Having multiple convolutional layers stacked along the depth of the network, allows the network to extract high-level features (not just edges and corners) from the input images.

How many layers does a Conv2D have?

Specifying model architecture As you can see, we specify three Conv2D layers in sequential order, with 3x3 kernel sizes, ReLU activation and 32, 64 and 128 filters, respectively.

How does a Conv2D layer work?

Conv2D class. 2D convolution layer (e.g. spatial convolution over images). This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs.

How many filters are there in the 1st Conv2D layer?

In this model, the first Conv2D layer had 16 filters, followed by two more Conv2D layers with 32 and 64 filters respectively.


1 Answers

The main difference between these two approaches is that the later (2 conv) has more flexibility in expressing non-linear transformations without loosing information. Maxpool removes information from the signal, dropout forces distributed representation, thus both effectively make it harder to propagate information. If, for given problem, highly non-linear transformation has to be applied on raw data, stacking multiple convs (with relu) will make it easier to learn, that's it. Also note that you are comparing a model with 3 max poolings with model with only 2, consequently the second one will potentially loose less information. Another thing is it has way bigger fully connected bit at the end, while the first one is tiny (64 neurons + 0.5 dropout means that you effectively have at most 32 neurons active, that is a tiny layer!). To sum up:

  1. These architectures differe in many aspects, not just stacking conv nets.
  2. Stacking convnets usually leads to less information being lost in processing; see for example "all convolutional" architectures.
like image 199
lejlot Avatar answered Sep 27 '22 21:09

lejlot