Tensorflow: simultaneous prediction on GPU and CPU

Tags:

I’m working with tensorflow and I want to speed up the prediction phase of a pre-trained Keras model (I'm not interested in the training phase) by using simultaneously the CPU and one GPU.

I tried to create 2 different threads that feed two different tensorflow sessions (one that runs on CPU and the other that runs on GPU). Each thread feeds a fixed number of batches (e.g. if we have an overall of 100 batches, I want to assign 20 batches for CPU and 80 on GPU, or any possible combination of the two) in a loop and combine the result. It would be better if the split was done automatically.

However even in this scenario, it seems that the batches are being fed in a synchronous way, because even sending few batches to the CPU and computing all the others in the GPU (with the GPU as bottleneck) I observed that the overall prediction time is always higher with respect to the test made only using the GPU.

I would expect it to be faster because when only the GPU is working the CPU usage is about 20-30%, thus there is some CPU available to speed up the computation.

I read a lot of discussions but they all deal with parallelism with multiple GPUs and not between GPU and CPU.

Here is a sample of the code I have written: the tensor_cpu and tensor_gpu objects are loaded from the same Keras model in this way:

with tf.device('/gpu:0'):
    model_gpu = load_model('model1.h5')
    tensor_gpu = model_gpu(x)

with tf.device('/cpu:0'):
    model_cpu = load_model('model1.h5')
    tensor_cpu = model_cpu(x)

Then the prediction is done as following:

def predict_on_device(session, predict_tensor, batches):
    for batch in batches:
        session.run(predict_tensor, feed_dict={x: batch})


def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu):
    session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session1.run(tf.global_variables_initializer())
    session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session2.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu]))
    t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:]))

    t_cpu.start()
    t_gpu.start()

    coord.join([t_cpu, t_gpu])

    session1.close()
    session2.close()

How can I achieve this CPU/GPU parallelization? I think I'm missing something.

Any kind of help would be very appreciated!

352

asked May 30 '17 06:05

battuzz

1 Answers

Here's my code that demonstrates how CPU and GPU execution can be done in parallel:

import tensorflow as tf
import numpy as np
from time import time
from threading import Thread

n = 1024 * 8

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n    , n]).astype(np.float32)

with tf.device('/cpu:0'):
    x = tf.placeholder(name='x', dtype=tf.float32)

def get_var(name):
    return tf.get_variable(name, shape=[n, n])

def op(name):
    w = get_var(name)
    y = x
    for _ in range(8):
        y = tf.matmul(y, w)
    return y

with tf.device('/cpu:0'):
    cpu = op('w_cpu')

with tf.device('/gpu:0'):
    gpu = op('w_gpu')

def f(session, y, data):
    return session.run(y, feed_dict={x : data})


with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
    sess.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    threads = []

    # comment out 0 or 1 of the following 2 lines:
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))]

    t0 = time()

    for t in threads:
        t.start()

    coord.join(threads)

    t1 = time()


print t1 - t0

The timing results are:

CPU thread: 4-5s (will vary by machine, of course).
GPU thread: 5s (It does 16x as much work).
Both at the same time: 5s

Note that there was no need to have 2 sessions (but that worked for me too).

The reasons you might be seeing different results could be

some contention for system resources (GPU execution does consume some host system resources, and if running the CPU thread crowds it, that could worsen the performance)
incorrect timing
part of your model can only run on GPU/CPU
bottleneck elsewhere
some other problem

answered Oct 21 '22 06:10

MWB

Related questions
                            
                                Why does Python 2 allow comparisons between lists and numbers? [duplicate]
                            
                                Pandas using too much memory with read_sql_table
                            
                                Django ImageField widget that accepts upload or external link as source
                            
                                Django/Haystack error: elasticsearch.exceptions.RequestError: TransportError(400, 'parsing_exception',...)
                            
                                Python - Perspective transform for OpenCV from a rotation angle
                            
                                datetime: conversion from string with timezone name not working
                            
                                Loading mysql table into python takes a very long time compared to R
                            
                                Docker Django could not connect to server: Connection refused
                            
                                Determine all files read into and written from an ipython notebook
                            
                                Beaker fails to find Python and Julia installations despite following installation instructions
                            
                                use the same README content in sphinx and github - relative links
                            
                                keras model.fit_generator() several times slower than model.fit()
                            
                                src/pip-delete-this-directory.txt can I delete this file?
                            
                                How to use OpenCV functions in Keras Lambda Layer?
                            
                                How to shade a region with Altair
                            
                                Multiprocessing output differs between Linux and Windows - Why?
                            
                                Limit child collections in initial query sqlalchemy
                            
                                What happens in numpy's log function? Are there ways to improve the performance?
                            
                                Python: append an original object vs append a copy of object [duplicate]
                            
                                No module named _gdal

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tensorflow: simultaneous prediction on GPU and CPU

Tags:

performance

python

tensorflow

keras

battuzz

People also ask

1 Answers

MWB

Recent Activity

Donate For Us