Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tensorflow: difference between multi GPUs and distributed tensorflow

I am little confused about these two concepts.

I saw some examples about multi GPU without using clusters and servers in the code.

Are these two different? What is the difference?

Thanks a lot!

like image 869
xyd Avatar asked Jun 09 '16 17:06

xyd


1 Answers

It depends a little on the perspective from which you look at it. In any multi-* setup, either multi-GPU or multi-machine, you need to decide how to split up your computation across the parallel resources. In a single-node, multi-GPU setup, there are two very reasonable choices:

(1) Intra-model parallelism. If a model has long, independent computation paths, then you can split the model across multiple GPUs and have each compute a part of it. This requires careful understanding of the model and the computational dependencies.

(2) Replicated training. Start up multiple copies of the model, train them, and then synchronize their learning (the gradients applied to their weights & biases).

Our released Inception model has some good diagrams in the readme that show how both multi-GPU and distributed training work.

But to tl;dr that source: In a multi-GPU setup, it's often best to synchronously update the model by storing the weights on the CPU (well, in its attached DRAM). But in a multi-machine setup, we often use a separate "parameter server" that stores and propagates the weight updates. To scale that to a lot of replicas, you can shard the parameters across multiple parameter servers.

With multiple GPUs and parameter servers, you'll find yourself being more careful about device placement using constructs such as with tf.device('/gpu:1'), or placing weights on the parameter servers using tf.train.replica_device_setter to assign it on /job:ps or /job:worker.

In general, training on a bunch of GPUs in a single machine is much more efficient -- it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine -- but distributed training lets you scale to even larger numbers, and harness more CPU.

like image 194
dga Avatar answered Jan 01 '23 11:01

dga