What is the reason to use parameter server in distributed tensorflow learning?

Tags:

distributed

Short version: can't we store variables in one of the workers and not use parameter servers?

Long version: I want to implement synchronous distributed learning of neural network in tensorflow. I want each worker to have a full copy of the model during training.

I've read distributed tensorflow tutorial and code of distributed training imagenet and didn't get why do we need parameter servers.

I see that they are used for storing values of variables and replica_device_setter takes care that variables are evenly distributed between parameter servers (probably it does something more, I wasn't able to fully understand the code).

The question is: why don't we use one of the workers to store variables? Will I achieve that if I use

with tf.device('/job:worker/task:0/cpu:0'):

instead of

with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):

for Variaibles? If that works is there downside comparing to solution with parameter servers?

365

asked Sep 18 '16 15:09

kolesov93

1 Answers

Using parameter server can give you better network utilization, and lets you scale your models to more machines.

A concrete example, suppose you have 250M parameters, it takes 1 second to compute gradient on each worker, and there are 10 workers. This means that each worker has to send/receive 1 GB of data to 9 other workers every second, which needs 72 Gbps full duplex network capacity on each worker, which is not practical.

More realistically you could have 10 Gbps network capacity per worker. You prevent network bottlenecks by using parameter server split over 8 machines. Each worker machine communicates with each parameter machine for 1/8th of parameters.

answered Sep 28 '22 07:09

Yaroslav Bulatov

Related questions
                            
                                How to access tensor_content values in TensorProto in TensorFlow?
                            
                                Why does TensorFlow return [[nan nan]] instead of probabilities from a CSV file?
                            
                                What's the difference between -c opt and --config=opt when building TensorFlow from source?
                            
                                Tensorflow LinearRegressor Feature Cannot have rank 0
                            
                                Show more images in Tensorboard - Tensorflow object detection
                            
                                Why can GPU do matrix multiplication faster than CPU?
                            
                                How do I change the dtype in TensorFlow for a csv file?
                            
                                Attach a queue to a numpy array in tensorflow for data fetch instead of files?
                            
                                All runs are not visible on TensorBoard
                            
                                Tensorflow: how to close tensorboard server
                            
                                tflearn / tensorflow does not learn xor
                            
                                Training Tensorflow Inception-v3 Imagenet on modest hardware setup
                            
                                How to merge not all summaries in tensorflow?
                            
                                keep_prob in TensorFlow MNIST tutorial
                            
                                HOW TO: Import TensorFlow in Jupyter Notebook from Conda with GPU support?
                            
                                Input images with dynamic dimensions in Tensorflow-lite
                            
                                Training a simple model in Tensorflow GPU slower than CPU
                            
                                Does Google Tensorflow support OpenCL
                            
                                Why do I get AttributeError: module 'tensorflow' has no attribute 'placeholder'?
                            
                                sum over a list of tensors in tensorflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With