Keras shows no Improvements to training speed with GPU (partial GPU usage?!)

Question

I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu was installed and mentioned in requirements.txt and not tensorflow).

I am not seeing any speed improvements when training models on these instances compared to using a CPU, infact I am getting training speeds per epoch that is almost same as I am getting on my 4-core laptop CPU (p2.xlarge also has 4 vCPUs with a Tesla K80 GPU). I am not sure if i need to do some changes to my code to accommodate faster/parallel processing that GPU can offer. I am pasting below my code for my model:

model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
                        return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])

model.fit(X_np, y_np, epochs=100, validation_split=0.25)

Also interestingly the GPU seems to be utilizing between 50%-60% of its processing power and almost all of its memory every time I check for GPU status using nvidia-smi (but both fall to 0% and 1MiB respectively when not training):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    73W / 149W |  10919MiB / 11439MiB |     52%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1665      C   ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+

Also if you'd like to see my logs about using the GPU from Jupyter Notebook:

[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7

2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

Please suggest what could be the problem. Thanks a ton for looking at this anyways!

Daniel Möller · Accepted Answer

That happens because you're using LSTM layers.

Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.

I confirmed that by my own experience:

Got terrible speed using LSTMs in my model
Decided to test the model removing all LSTMs (got a pure convolutional model)
The resulting speed was simply astonishing!!!

This article about using GPUs and tensorflow also confirms that:

http://minimaxir.com/2017/07/cpu-or-gpu/

A possible solution?

You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.

I never tested it, but you'll most probably get a better performance with this.

Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.

mcsim · Answer

Try to use some bigger value for batch_size in model.fit, because the default is 32. Increase it until you get 100% CPU utilization.

Following suggestion from @dgumo, you can also put your data into /run/shm. This is an in-memory file system, which allows to access data in fastest possible way. Alternatively, you can ensure that your data resides at least on SSD. For example in /tmp.

Keras shows no Improvements to training speed with GPU (partial GPU usage?!)

Tags:

tensorflow

keras

Ishaan Sejwal

2 Answers

A possible solution?

Daniel Möller

mcsim

Recent Activity

Donate For Us

Keras shows no Improvements to training speed with GPU (partial GPU usage?!)

Tags:

tensorflow

keras

Ishaan Sejwal

2 Answers

A possible solution?

Daniel Möller

mcsim

Related questions

Recent Activity

Donate For Us