Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implications of using MPI with TensorFlow

I come from a sort of HPC background and I am just starting to learn about machine learning in general and TensorFlow in particular. I was initially surprised to find out that distributed TensorFlow is designed to communicate with TCP/IP by default though it makes sense in hindsight given what Google is and the kind of hardware it uses most commonly.

I am interested in experimenting with TensorFlow in a parallel way with MPI on a cluster. From my perspective, this should be advantageous because latency should be much lower due to MPI's use of Remote Direct Memory Access (RDMA) across machines without shared memory.

So my question is, why doesn't this approach seem to be more common given the increasing popularity of TensorFlow and machine learning ? Isn't latency a bottleneck ? Is there some typical problem that is solved, that makes this sort of solution impractical? Are there likely to be any meaningful differences between calling TensorFlow functions in a parallel way vs implementing MPI calls inside of the TensorFlow library ?

Thanks

like image 867
Cogitator Avatar asked Sep 18 '17 15:09

Cogitator


2 Answers

It seems tensorflow already supports MPI, as stated at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi MPI support for tensorflow was also discussed at https://arxiv.org/abs/1603.02339

Generally speaking, keep in mind MPI is best at sending/receiving messages, but not so great at sending notifications and acting upon events. Last but not least, MPI support of multi-threaded applications (e.g. MPI_THREAD_MULTIPLE) has not always been production-ready among MPI implementation s. These were two general statements and i honestly do not know if they are relevant for tensorflow.

like image 120
Gilles Gouaillardet Avatar answered Nov 07 '22 15:11

Gilles Gouaillardet


According to the doc in Tensorflow git repo,actually tf utilizes gRPC library by detault, which is based on HTTP2 protocol, rather than TCP/IP protocol, and this paper should give you some insight, hope this information is useful.

like image 3
Kehe CAI Avatar answered Nov 07 '22 14:11

Kehe CAI