Multi GPU training slower than single GPU on Tensorflow

Tags:

I have created 3 virtual GPU's (have 1 GPU) and try to speedup vectorization on images. However, using provided below code with manual placement from off docs (here) I got strange results: training on all GPU two times slower than on a single one. Also check this code(and remove virtual device initialization) on machine with 3 physical GPU's - work the same.

Environment: Python 3.6, Ubuntu 18.04.3, tensorflow-gpu 1.14.0.

Code(this example create 3 virtual devices and you could test it on a PC with one GPU):

import os
import time
import numpy as np
import tensorflow as tf

start = time.time()

def load_graph(frozen_graph_filename):
    # We load the protobuf file from the disk and parse it to retrieve the
    # unserialized graph_def
    with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())

    # Then, we import the graph_def into a new Graph and returns it
    with tf.Graph().as_default() as graph:
        # The name var will prefix every op/nodes in your graph
        # Since we load everything in a new graph, this is not needed
        tf.import_graph_def(graph_def, name="")
    return graph

path_to_graph = '/imagenet/'  # Path to imagenet folder where graph file is placed
GRAPH = load_graph(os.path.join(path_to_graph, 'classify_image_graph_def.pb'))

# Create Session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
session = tf.Session(graph=GRAPH, config=config)

output_dir = '/vectors/'  # where to saved vectors from images

# Single GPU vectorization
for image_index, image in enumerate(selected_list):
    with Image.open(image) as f:
        image_data = f.convert('RGB')
        feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
        feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
        feature_vector = np.squeeze(feature_vector)
        outfile_name = os.path.basename(image) + ".vc"
        out_path = os.path.join(output_dir, outfile_name)
        # Save vector
        np.savetxt(out_path, feature_vector, delimiter=',')

print(f"Single GPU: {time.time() - start}")
start = time.time()

print("Start calculation on multiple GPU")
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Create 3 virtual GPUs with 1GB memory each
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

print("Create prepared ops")
start1 = time.time()
gpus = logical_gpus  # comment this line to use physical GPU devices for calculations

image_list = ['1.jpg', '2.jpg', '3.jpg']  # list with images to vectorize (tested on 100 and 1000 examples)
# Assign chunk of list to each GPU
# image_list1, image_list2, image_list3 = image_list[:len(image_list)],\
#                                         image_list[len(image_list):2*len(image_list)],\
#                                         image_list[2*len(image_list):]
selected_list = image_list # commit this line if you want to try to assign chunk of list manually to each GPU
output_vectors = []
if gpus:
  # Replicate your computation on multiple GPUs
  feature_vectors = []
  for gpu in gpus:  # iterating on a virtual GPU devices, not physical
    with tf.device(gpu.name):
      print(f"Assign list of images to {gpu.name.split(':', 4)[-1]}")
      # Try to assign chunk of list with images to each GPU - work the same time as single GPU
      # if gpu.name.split(':', 4)[-1] == "GPU:0":
      #     selected_list = image_list1
      # if gpu.name.split(':', 4)[-1] == "GPU:1":
      #     selected_list = image_list2
      # if gpu.name.split(':', 4)[-1] == "GPU:2":
      #     selected_list = image_list3
      for image_index, image in enumerate(selected_list):
          with Image.open(image) as f:
            image_data = f.convert('RGB')
            feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
            feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
            feature_vectors.append(feature_vector)

print("All images has been assigned to GPU's")
print(f"Time spend on prep ops: {time.time() - start1}")
print("Start calculation on multiple GPU")
start1 = time.time()
for image_index, image in enumerate(image_list):
  feature_vector = np.squeeze(feature_vectors[image_index])
  outfile_name = os.path.basename(image) + ".vc"
  out_path = os.path.join(output_dir, outfile_name)
  # Save vector
  np.savetxt(out_path, feature_vector, delimiter=',')

# Close session
session.close()
print(f"Calc on GPU's spend: {time.time() - start1}")
print(f"All time, spend on multiple GPU: {time.time() - start}")

Provide view of output(from list with 100 images):

1 Physical GPU, 3 Logical GPUs
Single GPU: 18.76301646232605
Start calculation on multiple GPU
Create prepared ops
Assign list of images to GPU:0
Assign list of images to GPU:1
Assign list of images to GPU:2
All images has been assigned to GPU's
Time spend on prep ops: 18.263537883758545
Start calculation on multiple GPU
Calc on GPU's spend: 11.697082042694092
All time, spend on multiple GPU: 29.960679531097412

What I tried: split list with images into 3 chunks and assign each chunk to GPU(see commited lines of code). This reduce multiGPU time to 17 seconds, which a little bit faster than single GPU run 18 seconds (~5%).

Expected results: MultiGPU version is faster than singleGPU version (at least 1.5x speedup).

Ideas, why it maybe happens: I wrote calculation in a wrong way

744

asked Dec 26 '19 09:12

Dmitriy Kisil

1 Answers

There are two basic misunderstandings that are causing your trouble:

with tf.device(...): applies to the graph nodes created within the scope, not Session.run calls.
Session.run is a blocking call. They don't run in parallel. TensorFlow can only parallelize the contents of a single Session.run.

Modern TF (>= 2.0) can make this much easier.

Mainly you can stop using tf.Session and tf.Graph. Use @tf.function instead, I believe this basic structure will work:

@tf.function
def my_function(inputs, gpus, model):
  results = []
  for input, gpu in zip(inputs, gpus):
    with tf.device(gpu):
      results.append(model(input))    
  return results

But you will want to try a more realistic test. With just 3 images you're not at all measuring real performance.

Also note:

The tf.distribute.Strategy class may help simplify some of this, by separating the device specification from the @tf.function that's being run. strategy.experimental_run_v2(my_function, args=(dataset_inputs,))
Using tf.data.Dataset input pipelines will help you overlap loading/preprocessing with model execution.

But if you're really intent on doing this using tf.Graph and tf.Session I think you basically need to reorganize your code from this:

# Your code:
# Builds a graph
graph = build_graph()

for gpu in gpus():
  with tf.device(gpu):
    # Calls `gpu` in each device scope.
    session.run(...)

To this:

g = tf.Graph()
with g.as_default():
  results = []
  for gpu in gpus:
    # Build the graph, on each device
    input = iterator.get_next()
    with tf.device(gpu):    
      results.append(my_function(input))       

# Use a single `Session.run` call
np_result = session.run(results, feed_dict={inputs: my_inputs})

144

answered Oct 07 '22 06:10

mdaoust

Related questions
                            
                                I'm not able to use python requests session cookies in selenium
                            
                                Fill rows with consecutive values and above rows using pandas
                            
                                Setting meld as git mergetool with Python3
                            
                                Trouble getting the screenshot of any element after zooming in
                            
                                Keras you are trying to load a weight file containing 2 layers into a model with 1 layers
                            
                                Given a dict iterator, get the dict
                            
                                How to use torchvision.transforms for data augmentation of segmentation task in Pytorch?
                            
                                How to divide a rectangle in specific number of rows and columns?
                            
                                How can I make a recursive Python type defined over several aliases?
                            
                                python3: extract IP address from compiled pattern
                            
                                How does @pytest.mark.filterwarnings work?
                            
                                Upload file from memory to S3
                            
                                Fast Fourier Transform in Python
                            
                                Can I define an action on file upload when using ipywidgets FileUpload widget
                            
                                python how to find the number of days in each month from Dec 2019 and forward between two date columns
                            
                                TensorFlow 2 custom loss: "No gradients provided for any variable" error
                            
                                Fetching data with snowflake connector throws EmptyPyArrowIterator error
                            
                                SimpleCookie generic type
                            
                                Block network access of a test/process on Travis?
                            
                                How to use the Language Server Protocol for Python in Neovim

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multi GPU training slower than single GPU on Tensorflow

Tags:

python

python-3.x

tensorflow

multi-gpu

Dmitriy Kisil

People also ask

1 Answers

mdaoust

Recent Activity

Donate For Us