Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a CNN slower to train than a fully connected MLP in Keras?

I looked to the following examples from Keras:

MLP in MNIST: https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py

CNN in MNIST: https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py

I run both in Theano on CPU. In the MLP I have a mean time of approximately 16s per epoch with a total of 669,706 parameters:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_33 (Dense)             (None, 512)               401920    
_________________________________________________________________
dropout_16 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_34 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_17 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_35 (Dense)             (None, 10)                5130      
=================================================================
Total params: 669,706.0
Trainable params: 669,706.0
Non-trainable params: 0.0

In the CNN, I eliminated the last hidden layer from the original code. I also changed the optimizer to rmsprop to make both cases comparable, leaving the following architecture:

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_36 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_37 (Conv2D)           (None, 24, 24, 64)        18496     
_________________________________________________________________
max_pooling2d_17 (MaxPooling (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 12, 12, 64)        0         
_________________________________________________________________
flatten_17 (Flatten)         (None, 9216)              0         
_________________________________________________________________
dense_40 (Dense)             (None, 10)                92170     
=================================================================
Total params: 110,986.0
Trainable params: 110,986.0
Non-trainable params: 0.0

However, the average time here is of approximately 340 s per epoch! Even though there are six times less parameters!

To check more on this, I reduced the number of filters per layer to 4, leaving the following architecture:

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_38 (Conv2D)           (None, 26, 26, 4)         40        
_________________________________________________________________
conv2d_39 (Conv2D)           (None, 24, 24, 4)         148       
_________________________________________________________________
max_pooling2d_18 (MaxPooling (None, 12, 12, 4)         0         
_________________________________________________________________
dropout_23 (Dropout)         (None, 12, 12, 4)         0         
_________________________________________________________________
flatten_18 (Flatten)         (None, 576)               0         
_________________________________________________________________
dense_41 (Dense)             (None, 10)                5770      
=================================================================
Total params: 5,958.0
Trainable params: 5,958.0
Non-trainable params: 0.0

Now the time is of 28 s per epoch even though there are roughly 6000 parameters!!

Why is this? Intuitively, the optimization should only depend on the number of variables and the calculation of the gradient (which due to same batch size should be similar).

Some light on this? Thank you

like image 922
Jorge del Val Avatar asked Mar 30 '17 14:03

Jorge del Val


People also ask

Is CNN faster than MLP?

It is clearly evident that the CNN converges faster than the MLP model in terms of epochs but each epoch in CNN model takes more time compared to MLP model as the number of parameters is more in CNN model than in MLP model in this example.

Why does CNN perform better than MLP?

Both MLP and CNN can be used for Image classification however MLP takes vector as input and CNN takes tensor as input so CNN can understand spatial relation(relation between nearby pixels of image)between pixels of images better thus for complicated images CNN will perform better than MLP.

Why are convolutional layers faster than fully connected layers for images?

Convolutions are not densely connected, not all input nodes affect all output nodes. This gives convolutional layers more flexibility in learning. Moreover, the number of weights per layer is a lot smaller, which helps a lot with high-dimensional inputs such as image data.

Do CNNs take longer to train?

Convolutional Neural Networks generally can take a long time to train. We have found that even performing adequate transfer learning on a pre-trained model such as VGG16 or ResNet can take over an hour per epoch if working with a large dataset over a pipeline which includes aggressive image augmentation.


2 Answers

I assume the kernel size is (3x3) for all the convolution operations and input 2D array channel size as 3.

For conv2d_36 you would have:

  • 3 * 32 = number of operation for all the channels
  • 26 * 26 = number of convolution operation per channel
  • 3 * 3 = number of multiplication per convolution

So, excluding all the summation(bias + conv internal),

  • For conv2d_36 you would have 3 * 32 * 26 * 26 * 3 * 3 =~ 585k multiplication operations
  • For conv2d_37, similarly 32 * 64 * 24 * 24 * 3 * 3 =~ 10.6M multiplication operations
  • For dense_40 as there is no convolution, it would be equal to 9216 * 10 = 92k multiplication operations.

When we sum up all of them, there are ~11.3M single multiplication operations for second model with CNN.

On the other hand, if we flatten it and apply MLP,

  • For dense_33 layer, there will be 28 * 28 * 3 * 512 = 1.2M multiplication operations
  • For dense_34 layer, there will be 512 * 512 = 262k multiplication operations
  • For dense_35 layer, there will be 512 * 10 = 5k multiplication operations

When we sum up all of them, there are ~1.5M single multiplication operations for first model with MLP.

Hence, just the multiplications of CNN model are ~7.5 times more than MLP model. Considering the overhead within the layers, other operation costs like summation and memory copy/access operations it seems totally reasonable for CNN model to be as slow as you mentioned.

like image 54
Deniz Beker Avatar answered Nov 15 '22 09:11

Deniz Beker


The convolution operation are much more complex than dense layer. Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel. Every convolution is essentially a multiple nested loop. This means that dense layer needs a fraction of the time respect to convolutional layers. Wikipedia has an enlightening example of the convolution operation.

like image 31
emanuele Avatar answered Nov 15 '22 08:11

emanuele