I use this notebook from Kaggle to run LSTM neural network.
I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.
CPU perfomance: 8 min per epoch;GPU perfomance: 26 min per epoch.After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM. 
Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.
Questions:
1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.
2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?
P.S.:
My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)
My GPU: nVidia GeForce GTX 1050 Ti (4 GB)
GPUs are the de-facto standard for LSTM usage and deliver a 6x speedup during training and 140x higher throughput during inference when compared to CPU implementations. cuDNN is a GPU-accelerated deep neural network library that supports training of LSTM recurrent neural networks for sequence learning.
However, the LSTM training method, such as backward propagation through time (BPTT), is really slow. In this paper, by separating the LSTM cell into forward and recurrent substructures, we propose a much simpler and faster training method than the BPTT.
This is mainly due to the sequential computation in the LSTM layer. Remember that LSTM requires sequential input to calculate the hidden layer weights iteratively, in other words, you must wait for the hidden state at time t-1 to calculate the hidden state at time t.
According to the Keras documentation, a CuDNNLSTM is a: Fast LSTM implementation backed by CuDNN. Can only be run on GPU, with the TensorFlow backend. It is my belief that Keras automatically uses the GPU wherever possible.
I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):
Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With