Why is a CNN slower to train than a fully connected MLP in Keras?

Tags:

I looked to the following examples from Keras:

MLP in MNIST: https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py

CNN in MNIST: https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py

I run both in Theano on CPU. In the MLP I have a mean time of approximately 16s per epoch with a total of 669,706 parameters:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_33 (Dense)             (None, 512)               401920    
_________________________________________________________________
dropout_16 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_34 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_17 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_35 (Dense)             (None, 10)                5130      
=================================================================
Total params: 669,706.0
Trainable params: 669,706.0
Non-trainable params: 0.0

In the CNN, I eliminated the last hidden layer from the original code. I also changed the optimizer to rmsprop to make both cases comparable, leaving the following architecture:

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_36 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_37 (Conv2D)           (None, 24, 24, 64)        18496     
_________________________________________________________________
max_pooling2d_17 (MaxPooling (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 12, 12, 64)        0         
_________________________________________________________________
flatten_17 (Flatten)         (None, 9216)              0         
_________________________________________________________________
dense_40 (Dense)             (None, 10)                92170     
=================================================================
Total params: 110,986.0
Trainable params: 110,986.0
Non-trainable params: 0.0

However, the average time here is of approximately 340 s per epoch! Even though there are six times less parameters!

To check more on this, I reduced the number of filters per layer to 4, leaving the following architecture:

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_38 (Conv2D)           (None, 26, 26, 4)         40        
_________________________________________________________________
conv2d_39 (Conv2D)           (None, 24, 24, 4)         148       
_________________________________________________________________
max_pooling2d_18 (MaxPooling (None, 12, 12, 4)         0         
_________________________________________________________________
dropout_23 (Dropout)         (None, 12, 12, 4)         0         
_________________________________________________________________
flatten_18 (Flatten)         (None, 576)               0         
_________________________________________________________________
dense_41 (Dense)             (None, 10)                5770      
=================================================================
Total params: 5,958.0
Trainable params: 5,958.0
Non-trainable params: 0.0

Now the time is of 28 s per epoch even though there are roughly 6000 parameters!!

Why is this? Intuitively, the optimization should only depend on the number of variables and the calculation of the gradient (which due to same batch size should be similar).

Some light on this? Thank you

922

asked Mar 30 '17 14:03

Jorge del Val

2 Answers

I assume the kernel size is (3x3) for all the convolution operations and input 2D array channel size as 3.

For conv2d_36 you would have:

3 * 32 = number of operation for all the channels
26 * 26 = number of convolution operation per channel
3 * 3 = number of multiplication per convolution

So, excluding all the summation(bias + conv internal),

For conv2d_36 you would have 3 * 32 * 26 * 26 * 3 * 3 =~ 585k multiplication operations
For conv2d_37, similarly 32 * 64 * 24 * 24 * 3 * 3 =~ 10.6M multiplication operations
For dense_40 as there is no convolution, it would be equal to 9216 * 10 = 92k multiplication operations.

When we sum up all of them, there are ~11.3M single multiplication operations for second model with CNN.

On the other hand, if we flatten it and apply MLP,

For dense_33 layer, there will be 28 * 28 * 3 * 512 = 1.2M multiplication operations
For dense_34 layer, there will be 512 * 512 = 262k multiplication operations
For dense_35 layer, there will be 512 * 10 = 5k multiplication operations

When we sum up all of them, there are ~1.5M single multiplication operations for first model with MLP.

Hence, just the multiplications of CNN model are ~7.5 times more than MLP model. Considering the overhead within the layers, other operation costs like summation and memory copy/access operations it seems totally reasonable for CNN model to be as slow as you mentioned.

answered Nov 15 '22 09:11

Deniz Beker

The convolution operation are much more complex than dense layer. Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel. Every convolution is essentially a multiple nested loop. This means that dense layer needs a fraction of the time respect to convolutional layers. Wikipedia has an enlightening example of the convolution operation.

answered Nov 15 '22 08:11

emanuele

Related questions
                            
                                Resilient backpropagation neural network - question about gradient
                            
                                Is it possible for Encog or Neuroph to run on Android?
                            
                                PyBrain neuron manipulation
                            
                                Conceptual issues on training neural network wih particle swarm optimization
                            
                                Training Algorithm to train this data
                            
                                Gradient checking in backpropagation
                            
                                How do we get/define filters in convolutional neural networks?
                            
                                How to use keras for XOR
                            
                                Caret Neural Network Error: "missing values in resampled performance measures"
                            
                                Changing the solver parameters in Caffe through pycaffe
                            
                                Multilayer-perceptron, visualizing decision boundaries (2D) in Python
                            
                                Multiple accuracy layers in Caffe
                            
                                XOR gate with a neural network
                            
                                MPSCNN Weight Ordering
                            
                                How is Growing Neural Gas used for clustering?
                            
                                Min-Max normalization Layer in Caffe
                            
                                Keras correct input shape for multilayer perceptron
                            
                                Panel data in Keras LSTM
                            
                                How to use max pooling to gather information from LSTM nodes
                            
                                Threading in tensorflow's input pipeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is a CNN slower to train than a fully connected MLP in Keras?

Tags:

neural-network

keras

conv-neural-network

theano

Jorge del Val

People also ask

2 Answers

Deniz Beker

emanuele

Recent Activity

Donate For Us