Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use distributed DNN training in TensorFlow?

Google released TensorFlow today.

I have been poking around in the code, and I don't see anything in the code or API about training across a cluster of GPU servers.

Does it have distributed training functionality yet?

like image 811
solvingPuzzles Avatar asked Nov 09 '15 19:11

solvingPuzzles


3 Answers

Updated:

  • Distributed TensorFlow Documentation

  • Distributed TensorFlow Source

The release occurred on 2/26/2016 and was announced by coauthor Derek Murray in the original issue here and uses gRPC for inter-process communication.

Previous:

Before the update above, a distributed implementation of TensorFlow had not been released yet. Support for a distributed implementation was the topic of this issue where coauthor Vijay Vasudevan wrote:

we are working on making a distributed implementation available, it's currently not in the initial release

and Jeff Dean later provided an update:

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

like image 138
Cosmo Harrigan Avatar answered Nov 14 '22 14:11

Cosmo Harrigan


It took us a few months, but today marks the release of the initial distributed TensorFlow runtime. This includes support for multiple machines, each with multiple GPUs, with communication provided by gRPC.

The current version includes the necessary backend components so that you can assemble a cluster manually and connect to it from a client program. More details are available in the readme.

like image 39
mrry Avatar answered Nov 14 '22 16:11

mrry


Update

As you may have noticed. Tensorflow has already supported distributed DNN training for quite some time. Please refer to its offcial website for details.

=========================================================================

Previous

No, it doesn't support distribute training yet, which is a little disappointing. But I don't think it is difficult to extend from single machine to multi-machine. Compared to other open source libraries, like Caffe, TF's data graph structure is more suitable for cross-machine tasks.

like image 1
ROBOT AI Avatar answered Nov 14 '22 16:11

ROBOT AI