I would like to know what is considered "best practice" for multi-GPU systems when training networks with TensorFlow.
E.g., one of my networks looks like this:
input
|
(...) <-- convolutional layers
|
_________
fully-connected | | fully-connected
output stream 1 -> | | <- output stream 2
Does TensorFlow allocate multiple GPUs efficiently? Or should I specify myself which GPU TensorFlow should use for a specific operation?
I have not benchmarked it by now, just started some GPU experiments today. However, at the moment I have not specified which device to use on the convolutional layers but I did specify it for the fully-connected layers:
# flattened information of the last convolutional layer
h_pooln_flat = tf.reshape(...)
with tf.device("/gpu:0"):
# stream 1 stuff
with tf.device("/gpu:1"):
# stream 2 stuff
Is this a good idea? Or should leave resource allocation open to TensorFlow?
I guess one single "stream" of convolutional layers can not be computed in parallel?! So it does not matter which device does the convolution-, pooling-, ... part?!
Any tips to get the best performance?
Currently I am training on one node of a Slurm cluster with 2 GPUs but potentially I could train on more nodes, so 4, 6 or even 8 GPUs. However, I guess there would be much overhead with more than 2 GPUs?
EDIT (slow multi-GPU performance): After some tests I am quite astonished...if I let TensorFlow decide what to allocate and remove the device-specific statements the network trains considerably faster. This was really surprising to me...what could be more effective than having each output stream on one GPU when there are two GPUs total? Additionally it seems (according to the output) that Tensorflow is only using one GPU?!
EDIT2 (NaN values): After some more tests I experienced that my manual setup of gpu:0
for stream 1 and gpu:1
for stream 2 is not only slower than letting TensorFlow decide what to use (and according to the piped script output TensorFlow just uses one GPU) but also sometimes my (I do not know why) my "gpu:0
for stream 1 and gpu:1
for stream 2"-solution just generates NaN values. Like directly or short after the init. Very weird.
Does TensorFlow need some kind of thread locking or manual copy of input data for multiple GPUs?
The logic for default placement of devices lies in simple_placer.cc
I may be missing something in the logic, but from this line it seems that it will put all GPU ops on gpu:0
You can see from implementation that placement strategy doesn't take into account data transfer or computation costs, so manual placement is often better than automatic. For instance, if you are doing some kind of input pipeline, default placement usually places some data processing ops on GPU which makes things slower overall.
As far as your implementation being slow...perhaps there's gpu0->gpu1 copy happening somewhere?
Getting multi-GPU setups to work is very much an open area, let us know what you find!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With