Using Keras with Tensorflow backend, I am trying to train an LSTM network and it is taking much longer to run it on a GPU than a CPU.
I am training an LSTM network using the fit_generator function. It takes CPU ~250 seconds per epoch while it takes GPU ~900 seconds per epoch. The packages in my GPU environment include
keras-applications 1.0.8 py_0 anaconda
keras-base 2.2.4 py36_0 anaconda
keras-gpu 2.2.4 0 anaconda
keras-preprocessing 1.1.0 py_1 anaconda
...
tensorflow 1.13.1 gpu_py36h3991807_0 anaconda
tensorflow-base 1.13.1 gpu_py36h8d69cac_0 anaconda
tensorflow-estimator 1.13.0 py_0 anaconda
tensorflow-gpu 1.13.1 pypi_0 pypi
My Cuda compilation tools are of version 9.1.85 and my CUDA and Driver version are
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:0A:00.0 Off | N/A |
| 0% 39C P8 5W / 225W | 7740MiB / 7952MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 2080 On | 00000000:42:00.0 Off | N/A |
| 0% 33C P8 19W / 225W | 142MiB / 7951MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 49251 C .../whsu014/.conda/envs/whsuphd/bin/python 7729MiB |
| 1 1354 G /usr/lib/xorg/Xorg 16MiB |
| 1 49251 C .../whsu014/.conda/envs/whsuphd/bin/python 113MiB |
+-----------------------------------------------------------------------------+
When I insert this line of code
tf.Session(config = tf.configProto(log_device_placement = True)):
I see the below in my terminal
...
ining_1/Adam/Const_10: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/Const_11: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720653: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/Const_11: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/add_15/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720666: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/add_15/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
...
So it seems that Tensorflow is using GPU.
When I profile the code, on GPU this is the first 10 lines
10852017 function calls (10524203 primitive calls) in 184.768 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
16200 173.827 0.011 173.827 0.011 {built-in method _pywrap_tensorflow_internal.TF_SessionRunCallable}
6 0.926 0.154 0.926 0.154 {built-in method _pywrap_tensorflow_internal.TF_SessionMakeCallable}
62 0.813 0.013 0.813 0.013 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
156954 0.414 0.000 0.415 0.000 {built-in method numpy.array}
16200 0.379 0.000 1.042 0.000 training.py:643(_standardize_user_data)
24300 0.338 0.000 0.338 0.000 {method 'partition' of 'numpy.ndarray' objects}
68 0.301 0.004 0.301 0.004 {built-in method _pywrap_tensorflow_internal.ExtendSession}
32458 0.223 0.000 2.122 0.000 tensorflow_backend.py:156(get_session)
3206 0.212 0.000 0.238 0.000 tf_stack.py:31(extract_stack)
76024 0.210 0.000 0.702 0.000 ops.py:5246(get_controller)
...
on CPU this is the first 10 lines
22123473 function calls (21647174 primitive calls) in 60.173 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
16269 42.491 0.003 42.491 0.003 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_Run}
16269 0.568 0.000 48.964 0.003 session.py:1042(_run)
56 0.532 0.010 0.532 0.010 {built-in method time.sleep}
153641 0.458 0.000 0.460 0.000 {built-in method numpy.core.multiarray.array}
183148/125354 0.447 0.000 1.316 0.000 python_message.py:469(init)
1226659 0.362 0.000 0.364 0.000 {built-in method builtins.getattr}
2302110/2301986 0.339 0.000 0.358 0.000 {built-in method builtins.isinstance}
8 0.285 0.036 0.285 0.036 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_ExtendGraph}
12150 0.267 0.000 0.271 0.000 callbacks.py:211(on_batch_end)
147026/49078 0.264 0.000 1.429 0.000 python_message.py:1008(ByteSize)
...
This is my code.
def train_generator(x_list, y_list):
# 0.1 validatioin split
train_length = (len(x_list)//10)*9
while True:
for i in range(train_length):
train_x = np.array([x_list[i]])
train_y = np.array([y_list[i]])
yield train_x, train_y
def val_generator(x_list, y_list):
# 0.1 validation split
val_length = len(x_list)//10
while True:
for i in range(-val_length, 0, 1):
val_x = np.array([x_list[i]])
val_y = np.array([y_list[i]])
yield val_x, val_y
with tf.Session(config = tf.ConfigProto(log_device_placement = True)):
model = Sequential()
model.add(LSTM(64, return_sequences=False,
input_shape=(None, 24)))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
checkpointer = ModelCheckpoint(filepath="weights.hdf5",
monitor='val_loss', verbose=1,
save_best_only=True)
history = model.fit_generator(generator=train_generator(train_x,
train_y),
steps_per_epoch=(len(train_x)//10)*9,
epochs=5,
validation_data=val_generator(train_x,
train_y),
validation_steps=len(train_x)//10,
callbacks=[checkpointer],
verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='validation')
pyplot.legend()
pyplot.show()
I expect a significant speed up when using GPU for training. How can I fix this? Can someone help me to understand what is causing the slowdown? Thank you.
Couple of observations:
CuDNNLSTM
instead of LSTM
to train on GPU, you will see considerable increase in speed.GPUs should be used for highly intensive tasks and computations (very big LSTM/heavy CNN networks). Nevertheless, for very small MLPs and even small LSTMs you might observe that the network trains equally fast on CPU and GPU or that in some particular cases the speed on CPU is even better (very particular cases with super small networks).
UPDATE FOR TENSORFLOW >= 2.0
The imports default to using CuDNNLSTM
/CuDNNGRU
if the video card is detected; therefore it is not needed explicitly to import them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With