Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow slower on GPU than on CPU

Using Keras with Tensorflow backend, I am trying to train an LSTM network and it is taking much longer to run it on a GPU than a CPU.

I am training an LSTM network using the fit_generator function. It takes CPU ~250 seconds per epoch while it takes GPU ~900 seconds per epoch. The packages in my GPU environment include

keras-applications        1.0.8                      py_0    anaconda
keras-base                2.2.4                    py36_0    anaconda
keras-gpu                 2.2.4                         0    anaconda
keras-preprocessing       1.1.0                      py_1    anaconda
...
tensorflow                1.13.1          gpu_py36h3991807_0    anaconda
tensorflow-base           1.13.1          gpu_py36h8d69cac_0    anaconda
tensorflow-estimator      1.13.0                     py_0    anaconda
tensorflow-gpu            1.13.1                   pypi_0    pypi

My Cuda compilation tools are of version 9.1.85 and my CUDA and Driver version are

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    On   | 00000000:0A:00.0 Off |                  N/A |
|  0%   39C    P8     5W / 225W |   7740MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    On   | 00000000:42:00.0 Off |                  N/A |
|  0%   33C    P8    19W / 225W |    142MiB /  7951MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     49251      C   .../whsu014/.conda/envs/whsuphd/bin/python  7729MiB |
|    1      1354      G   /usr/lib/xorg/Xorg                            16MiB |
|    1     49251      C   .../whsu014/.conda/envs/whsuphd/bin/python   113MiB |
+-----------------------------------------------------------------------------+

When I insert this line of code

tf.Session(config = tf.configProto(log_device_placement = True)):

I see the below in my terminal

...
ining_1/Adam/Const_10: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/Const_11: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720653: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/Const_11: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/add_15/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720666: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/add_15/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
...

So it seems that Tensorflow is using GPU.

When I profile the code, on GPU this is the first 10 lines

10852017 function calls (10524203 primitive calls) in 184.768 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    16200  173.827    0.011  173.827    0.011 {built-in method _pywrap_tensorflow_internal.TF_SessionRunCallable}
        6    0.926    0.154    0.926    0.154 {built-in method _pywrap_tensorflow_internal.TF_SessionMakeCallable}
       62    0.813    0.013    0.813    0.013 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
   156954    0.414    0.000    0.415    0.000 {built-in method numpy.array}
    16200    0.379    0.000    1.042    0.000 training.py:643(_standardize_user_data)
    24300    0.338    0.000    0.338    0.000 {method 'partition' of 'numpy.ndarray' objects}
       68    0.301    0.004    0.301    0.004 {built-in method _pywrap_tensorflow_internal.ExtendSession}
    32458    0.223    0.000    2.122    0.000 tensorflow_backend.py:156(get_session)
     3206    0.212    0.000    0.238    0.000 tf_stack.py:31(extract_stack)
    76024    0.210    0.000    0.702    0.000 ops.py:5246(get_controller)
...

on CPU this is the first 10 lines

22123473 function calls (21647174 primitive calls) in 60.173 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    16269   42.491    0.003   42.491    0.003 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_Run}
    16269    0.568    0.000   48.964    0.003 session.py:1042(_run)
       56    0.532    0.010    0.532    0.010 {built-in method time.sleep}
   153641    0.458    0.000    0.460    0.000 {built-in method numpy.core.multiarray.array}
183148/125354    0.447    0.000    1.316    0.000 python_message.py:469(init)
  1226659    0.362    0.000    0.364    0.000 {built-in method builtins.getattr}
2302110/2301986    0.339    0.000    0.358    0.000 {built-in method builtins.isinstance}
        8    0.285    0.036    0.285    0.036 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_ExtendGraph}
    12150    0.267    0.000    0.271    0.000 callbacks.py:211(on_batch_end)
147026/49078    0.264    0.000    1.429    0.000 python_message.py:1008(ByteSize)
...

This is my code.

def train_generator(x_list, y_list):
    # 0.1 validatioin split
    train_length = (len(x_list)//10)*9
    while True:
        for i in range(train_length):
            train_x = np.array([x_list[i]])
            train_y = np.array([y_list[i]])
            yield train_x, train_y

def val_generator(x_list, y_list):
    # 0.1 validation split
    val_length = len(x_list)//10
    while True:
        for i in range(-val_length, 0, 1):
            val_x = np.array([x_list[i]])
            val_y = np.array([y_list[i]])
            yield val_x, val_y



with tf.Session(config = tf.ConfigProto(log_device_placement = True)):
model = Sequential()
model.add(LSTM(64, return_sequences=False,
               input_shape=(None, 24)))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
checkpointer = ModelCheckpoint(filepath="weights.hdf5",
                               monitor='val_loss', verbose=1,
                               save_best_only=True)

history = model.fit_generator(generator=train_generator(train_x,
                                                        train_y),
                              steps_per_epoch=(len(train_x)//10)*9,
                              epochs=5,
                              validation_data=val_generator(train_x,
                                                            train_y),
                              validation_steps=len(train_x)//10,
                              callbacks=[checkpointer],
                              verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='validation')
pyplot.legend()
pyplot.show()

I expect a significant speed up when using GPU for training. How can I fix this? Can someone help me to understand what is causing the slowdown? Thank you.

like image 327
wwwwwwwwww Avatar asked Mar 03 '23 19:03

wwwwwwwwww


1 Answers

Couple of observations:

  1. Use CuDNNLSTM instead of LSTM to train on GPU, you will see considerable increase in speed.
  2. Sometimes, for very small networks, the overhead of transferring between CPU and GPU outweighs the parallel computations made on GPU; in other words, there is more time lost on transferring the data than time gained by training on GPU.

GPUs should be used for highly intensive tasks and computations (very big LSTM/heavy CNN networks). Nevertheless, for very small MLPs and even small LSTMs you might observe that the network trains equally fast on CPU and GPU or that in some particular cases the speed on CPU is even better (very particular cases with super small networks).

UPDATE FOR TENSORFLOW >= 2.0

The imports default to using CuDNNLSTM/CuDNNGRU if the video card is detected; therefore it is not needed explicitly to import them.

like image 147
Timbus Calin Avatar answered Mar 12 '23 12:03

Timbus Calin