I have created 3 virtual GPU's (have 1 GPU) and try to speedup vectorization on images. However, using provided below code with manual placement from off docs (here) I got strange results: training on all GPU two times slower than on a single one. Also check this code(and remove virtual device initialization) on machine with 3 physical GPU's - work the same.
Environment: Python 3.6, Ubuntu 18.04.3, tensorflow-gpu 1.14.0.
Code(this example create 3 virtual devices and you could test it on a PC with one GPU):
import os
import time
import numpy as np
import tensorflow as tf
start = time.time()
def load_graph(frozen_graph_filename):
# We load the protobuf file from the disk and parse it to retrieve the
# unserialized graph_def
with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Then, we import the graph_def into a new Graph and returns it
with tf.Graph().as_default() as graph:
# The name var will prefix every op/nodes in your graph
# Since we load everything in a new graph, this is not needed
tf.import_graph_def(graph_def, name="")
return graph
path_to_graph = '/imagenet/' # Path to imagenet folder where graph file is placed
GRAPH = load_graph(os.path.join(path_to_graph, 'classify_image_graph_def.pb'))
# Create Session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
session = tf.Session(graph=GRAPH, config=config)
output_dir = '/vectors/' # where to saved vectors from images
# Single GPU vectorization
for image_index, image in enumerate(selected_list):
with Image.open(image) as f:
image_data = f.convert('RGB')
feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
feature_vector = np.squeeze(feature_vector)
outfile_name = os.path.basename(image) + ".vc"
out_path = os.path.join(output_dir, outfile_name)
# Save vector
np.savetxt(out_path, feature_vector, delimiter=',')
print(f"Single GPU: {time.time() - start}")
start = time.time()
print("Start calculation on multiple GPU")
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Create 3 virtual GPUs with 1GB memory each
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
print("Create prepared ops")
start1 = time.time()
gpus = logical_gpus # comment this line to use physical GPU devices for calculations
image_list = ['1.jpg', '2.jpg', '3.jpg'] # list with images to vectorize (tested on 100 and 1000 examples)
# Assign chunk of list to each GPU
# image_list1, image_list2, image_list3 = image_list[:len(image_list)],\
# image_list[len(image_list):2*len(image_list)],\
# image_list[2*len(image_list):]
selected_list = image_list # commit this line if you want to try to assign chunk of list manually to each GPU
output_vectors = []
if gpus:
# Replicate your computation on multiple GPUs
feature_vectors = []
for gpu in gpus: # iterating on a virtual GPU devices, not physical
with tf.device(gpu.name):
print(f"Assign list of images to {gpu.name.split(':', 4)[-1]}")
# Try to assign chunk of list with images to each GPU - work the same time as single GPU
# if gpu.name.split(':', 4)[-1] == "GPU:0":
# selected_list = image_list1
# if gpu.name.split(':', 4)[-1] == "GPU:1":
# selected_list = image_list2
# if gpu.name.split(':', 4)[-1] == "GPU:2":
# selected_list = image_list3
for image_index, image in enumerate(selected_list):
with Image.open(image) as f:
image_data = f.convert('RGB')
feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
feature_vectors.append(feature_vector)
print("All images has been assigned to GPU's")
print(f"Time spend on prep ops: {time.time() - start1}")
print("Start calculation on multiple GPU")
start1 = time.time()
for image_index, image in enumerate(image_list):
feature_vector = np.squeeze(feature_vectors[image_index])
outfile_name = os.path.basename(image) + ".vc"
out_path = os.path.join(output_dir, outfile_name)
# Save vector
np.savetxt(out_path, feature_vector, delimiter=',')
# Close session
session.close()
print(f"Calc on GPU's spend: {time.time() - start1}")
print(f"All time, spend on multiple GPU: {time.time() - start}")
Provide view of output(from list with 100 images):
1 Physical GPU, 3 Logical GPUs
Single GPU: 18.76301646232605
Start calculation on multiple GPU
Create prepared ops
Assign list of images to GPU:0
Assign list of images to GPU:1
Assign list of images to GPU:2
All images has been assigned to GPU's
Time spend on prep ops: 18.263537883758545
Start calculation on multiple GPU
Calc on GPU's spend: 11.697082042694092
All time, spend on multiple GPU: 29.960679531097412
What I tried: split list with images into 3 chunks and assign each chunk to GPU(see commited lines of code). This reduce multiGPU time to 17 seconds, which a little bit faster than single GPU run 18 seconds (~5%).
Expected results: MultiGPU version is faster than singleGPU version (at least 1.5x speedup).
Ideas, why it maybe happens: I wrote calculation in a wrong way
Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.
The most likely reason for the underutilization of your GPU is using a batch size that's too small.
There are two basic misunderstandings that are causing your trouble:
with tf.device(...):
applies to the graph nodes created within the scope, not Session.run
calls.
Session.run
is a blocking call. They don't run in parallel. TensorFlow can only parallelize the contents of a single Session.run
.
Modern TF (>= 2.0) can make this much easier.
Mainly you can stop using tf.Session
and tf.Graph
. Use @tf.function
instead, I believe this basic structure will work:
@tf.function
def my_function(inputs, gpus, model):
results = []
for input, gpu in zip(inputs, gpus):
with tf.device(gpu):
results.append(model(input))
return results
But you will want to try a more realistic test. With just 3 images you're not at all measuring real performance.
Also note:
tf.distribute.Strategy
class may help simplify some of this, by separating the device specification from the @tf.function
that's being run. strategy.experimental_run_v2(my_function, args=(dataset_inputs,))
tf.data.Dataset
input pipelines will help you overlap loading/preprocessing with model execution.But if you're really intent on doing this using tf.Graph
and tf.Session
I think you basically need to reorganize your code from this:
# Your code:
# Builds a graph
graph = build_graph()
for gpu in gpus():
with tf.device(gpu):
# Calls `gpu` in each device scope.
session.run(...)
To this:
g = tf.Graph()
with g.as_default():
results = []
for gpu in gpus:
# Build the graph, on each device
input = iterator.get_next()
with tf.device(gpu):
results.append(my_function(input))
# Use a single `Session.run` call
np_result = session.run(results, feed_dict={inputs: my_inputs})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With