I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu
was installed and mentioned in requirements.txt
and not tensorflow
).
I am not seeing any speed improvements when training models on these instances compared to using a CPU, infact I am getting training speeds per epoch that is almost same as I am getting on my 4-core laptop CPU (p2.xlarge also has 4 vCPUs with a Tesla K80 GPU). I am not sure if i need to do some changes to my code to accommodate faster/parallel processing that GPU can offer. I am pasting below my code for my model:
model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])
model.fit(X_np, y_np, epochs=100, validation_split=0.25)
Also interestingly the GPU seems to be utilizing between 50%-60% of its processing power and almost all of its memory every time I check for GPU status using nvidia-smi
(but both fall to 0% and 1MiB respectively when not training):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 73W / 149W | 10919MiB / 11439MiB | 52% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1665 C ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+
Also if you'd like to see my logs about using the GPU from Jupyter Notebook:
[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Please suggest what could be the problem. Thanks a ton for looking at this anyways!
That happens because you're using LSTM layers.
Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.
I confirmed that by my own experience:
This article about using GPUs and tensorflow also confirms that:
You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.
I never tested it, but you'll most probably get a better performance with this.
Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True
in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.
Try to use some bigger value for batch_size
in model.fit
, because the default is 32
. Increase it until you get 100% CPU utilization.
Following suggestion from @dgumo, you can also put your data into /run/shm
. This is an in-memory file system, which allows to access data in fastest possible way. Alternatively, you can ensure that your data resides at least on SSD. For example in /tmp
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With