Suppose I have 10 machines with 2 GPU each and I want to run a distributed TensorFlow cluster. How many parameter servers should I allocate VS masters?
A good heuristic is to allocate the smallest number of parameter servers so that network bandwidth does not become a bottleneck.
For instance, suppose you have 10 million parameters, and each computation step takes 1 second. This means each second a worker sends 40MB parameter update vector and receives the same size parameter vector. So each worker needs 320 Mbps duplex bandwidth. Suppose you have 10 workers. With a single parameter server, your PS server will require 3.2 Gbps bandwidth.
Now suppose your network cards are capable of 1 Gbps full-duplex. To avoid saturating ethernet cards, you will need at least 4 parameter server workers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With