Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow: simultaneous prediction on GPU and CPU

I’m working with tensorflow and I want to speed up the prediction phase of a pre-trained Keras model (I'm not interested in the training phase) by using simultaneously the CPU and one GPU.

I tried to create 2 different threads that feed two different tensorflow sessions (one that runs on CPU and the other that runs on GPU). Each thread feeds a fixed number of batches (e.g. if we have an overall of 100 batches, I want to assign 20 batches for CPU and 80 on GPU, or any possible combination of the two) in a loop and combine the result. It would be better if the split was done automatically.

However even in this scenario, it seems that the batches are being fed in a synchronous way, because even sending few batches to the CPU and computing all the others in the GPU (with the GPU as bottleneck) I observed that the overall prediction time is always higher with respect to the test made only using the GPU.

I would expect it to be faster because when only the GPU is working the CPU usage is about 20-30%, thus there is some CPU available to speed up the computation.

I read a lot of discussions but they all deal with parallelism with multiple GPUs and not between GPU and CPU.

Here is a sample of the code I have written: the tensor_cpu and tensor_gpu objects are loaded from the same Keras model in this way:

with tf.device('/gpu:0'):
    model_gpu = load_model('model1.h5')
    tensor_gpu = model_gpu(x)

with tf.device('/cpu:0'):
    model_cpu = load_model('model1.h5')
    tensor_cpu = model_cpu(x) 

Then the prediction is done as following:

def predict_on_device(session, predict_tensor, batches):
    for batch in batches:
        session.run(predict_tensor, feed_dict={x: batch})


def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu):
    session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session1.run(tf.global_variables_initializer())
    session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session2.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu]))
    t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:]))

    t_cpu.start()
    t_gpu.start()

    coord.join([t_cpu, t_gpu])

    session1.close()
    session2.close()

How can I achieve this CPU/GPU parallelization? I think I'm missing something.

Any kind of help would be very appreciated!

like image 352
battuzz Avatar asked May 30 '17 06:05

battuzz


People also ask

Can TensorFlow use both CPU and GPU?

If a TensorFlow operation has both CPU and GPU implementations, by default, the GPU device is prioritized when the operation is assigned. For example, tf. matmul has both CPU and GPU kernels and on a system with devices CPU:0 and GPU:0 , the GPU:0 device is selected to run tf.

Does TensorFlow automatically use multiple GPUs?

If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically. To override the device placement to use multiple GPUs, we manually specify the device that a computation node should run on.

Is TensorFlow better on CPU or GPU?

They noticed that the performance of TensorFlow depends significantly on the CPU for a small-size dataset. Also, they found it is more important to use a graphic processing unit (GPU) when training a large-size dataset.

Is a TensorFlow API to distribute training across multiple GPUs multiple machines or TPUs?

Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.


1 Answers

Here's my code that demonstrates how CPU and GPU execution can be done in parallel:

import tensorflow as tf
import numpy as np
from time import time
from threading import Thread

n = 1024 * 8

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n    , n]).astype(np.float32)

with tf.device('/cpu:0'):
    x = tf.placeholder(name='x', dtype=tf.float32)

def get_var(name):
    return tf.get_variable(name, shape=[n, n])

def op(name):
    w = get_var(name)
    y = x
    for _ in range(8):
        y = tf.matmul(y, w)
    return y

with tf.device('/cpu:0'):
    cpu = op('w_cpu')

with tf.device('/gpu:0'):
    gpu = op('w_gpu')

def f(session, y, data):
    return session.run(y, feed_dict={x : data})


with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
    sess.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    threads = []

    # comment out 0 or 1 of the following 2 lines:
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))]

    t0 = time()

    for t in threads:
        t.start()

    coord.join(threads)

    t1 = time()


print t1 - t0

The timing results are:

  • CPU thread: 4-5s (will vary by machine, of course).

  • GPU thread: 5s (It does 16x as much work).

  • Both at the same time: 5s

Note that there was no need to have 2 sessions (but that worked for me too).

The reasons you might be seeing different results could be

  • some contention for system resources (GPU execution does consume some host system resources, and if running the CPU thread crowds it, that could worsen the performance)

  • incorrect timing

  • part of your model can only run on GPU/CPU

  • bottleneck elsewhere

  • some other problem

like image 90
MWB Avatar answered Oct 21 '22 06:10

MWB