I’m working with tensorflow and I want to speed up the prediction phase of a pre-trained Keras model (I'm not interested in the training phase) by using simultaneously the CPU and one GPU.
I tried to create 2 different threads that feed two different tensorflow sessions (one that runs on CPU and the other that runs on GPU). Each thread feeds a fixed number of batches (e.g. if we have an overall of 100 batches, I want to assign 20 batches for CPU and 80 on GPU, or any possible combination of the two) in a loop and combine the result. It would be better if the split was done automatically.
However even in this scenario, it seems that the batches are being fed in a synchronous way, because even sending few batches to the CPU and computing all the others in the GPU (with the GPU as bottleneck) I observed that the overall prediction time is always higher with respect to the test made only using the GPU.
I would expect it to be faster because when only the GPU is working the CPU usage is about 20-30%, thus there is some CPU available to speed up the computation.
I read a lot of discussions but they all deal with parallelism with multiple GPUs and not between GPU and CPU.
Here is a sample of the code I have written: the tensor_cpu
and tensor_gpu
objects are loaded from the same Keras model in this way:
with tf.device('/gpu:0'):
model_gpu = load_model('model1.h5')
tensor_gpu = model_gpu(x)
with tf.device('/cpu:0'):
model_cpu = load_model('model1.h5')
tensor_cpu = model_cpu(x)
Then the prediction is done as following:
def predict_on_device(session, predict_tensor, batches):
for batch in batches:
session.run(predict_tensor, feed_dict={x: batch})
def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu):
session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
session1.run(tf.global_variables_initializer())
session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
session2.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu]))
t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:]))
t_cpu.start()
t_gpu.start()
coord.join([t_cpu, t_gpu])
session1.close()
session2.close()
How can I achieve this CPU/GPU parallelization? I think I'm missing something.
Any kind of help would be very appreciated!
If a TensorFlow operation has both CPU and GPU implementations, by default, the GPU device is prioritized when the operation is assigned. For example, tf. matmul has both CPU and GPU kernels and on a system with devices CPU:0 and GPU:0 , the GPU:0 device is selected to run tf.
If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically. To override the device placement to use multiple GPUs, we manually specify the device that a computation node should run on.
They noticed that the performance of TensorFlow depends significantly on the CPU for a small-size dataset. Also, they found it is more important to use a graphic processing unit (GPU) when training a large-size dataset.
Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.
Here's my code that demonstrates how CPU and GPU execution can be done in parallel:
import tensorflow as tf
import numpy as np
from time import time
from threading import Thread
n = 1024 * 8
data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n , n]).astype(np.float32)
with tf.device('/cpu:0'):
x = tf.placeholder(name='x', dtype=tf.float32)
def get_var(name):
return tf.get_variable(name, shape=[n, n])
def op(name):
w = get_var(name)
y = x
for _ in range(8):
y = tf.matmul(y, w)
return y
with tf.device('/cpu:0'):
cpu = op('w_cpu')
with tf.device('/gpu:0'):
gpu = op('w_gpu')
def f(session, y, data):
return session.run(y, feed_dict={x : data})
with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = []
# comment out 0 or 1 of the following 2 lines:
threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
threads += [Thread(target=f, args=(sess, gpu, data_gpu))]
t0 = time()
for t in threads:
t.start()
coord.join(threads)
t1 = time()
print t1 - t0
The timing results are:
CPU thread: 4-5s (will vary by machine, of course).
GPU thread: 5s (It does 16x as much work).
Both at the same time: 5s
Note that there was no need to have 2 sessions (but that worked for me too).
The reasons you might be seeing different results could be
some contention for system resources (GPU execution does consume some host system resources, and if running the CPU thread crowds it, that could worsen the performance)
incorrect timing
part of your model can only run on GPU/CPU
bottleneck elsewhere
some other problem
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With