Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what's a good ratio of parameter servers to masters in distributed tensorflow?

Suppose I have 10 machines with 2 GPU each and I want to run a distributed TensorFlow cluster. How many parameter servers should I allocate VS masters?

like image 644
Fra Avatar asked Sep 06 '17 05:09

Fra


1 Answers

A good heuristic is to allocate the smallest number of parameter servers so that network bandwidth does not become a bottleneck.

For instance, suppose you have 10 million parameters, and each computation step takes 1 second. This means each second a worker sends 40MB parameter update vector and receives the same size parameter vector. So each worker needs 320 Mbps duplex bandwidth. Suppose you have 10 workers. With a single parameter server, your PS server will require 3.2 Gbps bandwidth.

Now suppose your network cards are capable of 1 Gbps full-duplex. To avoid saturating ethernet cards, you will need at least 4 parameter server workers.

like image 144
Yaroslav Bulatov Avatar answered Nov 05 '22 21:11

Yaroslav Bulatov