Distributed Tensorflow: who applies the parameter update?

Tags:

tensorflow

I've used TensorFlow but am new to distributed TensorFlow for training models. My understanding is that current best practices favor the data-parallel model with asynchronous updates:

A paper published by the Google Brain team in April 2016 benchmarked various approaches and found that data parallelism with synchronous updates using a few spare replicas was the most efficient, not only converging faster but also producing a better model. -- Chapter 12 of Hands-On Machine Learning with Scikit-Learn and Tensorflow.

Now, my confusion from reading further about this architecture is figuring out which component applies the parameter updates: the workers or the parameter server?

In my illustration below, it's clear to me that the workers compute the gradients dJ/dw (the gradient of the loss J with respect to the parameter weights w). But who applies the gradient descent update rule?

enter image description here

What's a bit confusing is that this O'Reilly article on Distributed TensorFlow states the following:

In the more centralized architecture, the devices send their output in the form of gradients to the parameter servers. These servers collect and aggregate the gradients. In synchronous training, the parameter servers compute the latest up-to-date version of the model, and send it back to devices. In asynchronous training, parameter servers send gradients to devices that locally compute the new model. In both architectures, the loop repeats until training terminates.

The above paragraph suggests that in asynchronous training:

The workers compute gradients and send it to the parameter server.
The parameter server broadcasts the gradients to the workers.
Each worker receives the broadcasted gradients and applies the update rule.

Is my understanding correct? If it is, then that doesn't seem very asynchronous to me because the workers have to wait for the parameter server to broadcast the gradients. Any explanation would be appreciated.

344

asked Jul 31 '18 20:07

stackoverflowuser2010

1 Answers

I realize this was asked in 2018, but let's give it a shot.

Each Worker compute gradients
When a worker is done computing gradients it sends it to the parameter server.
The worker then gets sent the new parameters from the parameter server, without waiting for the other workers.

In the synchronous part, the workers will not continue training before every worker has sent its update to the server.

What this means in the asynchronous case is that every worker can have slightly different gradients, because they are fetching the gradients without waiting for every worker to update the parameter server.

194

answered Oct 21 '22 21:10

bjotta

Related questions
                            
                                What does tensorflow "op" do?
                            
                                save model weights at the end of every N epochs
                            
                                ModuleNotFoundError: No module named 'numpy.core._multiarray_umath' (While installing TensorFlow)
                            
                                SavedModel file does not exist when using Tensorflow hub
                            
                                Tensorflow no module named official
                            
                                Keras: UnboundLocalError: local variable 'logs' referenced before assignment
                            
                                Tensorflow: What does tf.nn.separable_conv2d do?
                            
                                tensorflow creating mask of varied lengths
                            
                                Accessing Tensorboard on AWS
                            
                                Attempting to reset tensorflow graph when using keras, failing
                            
                                tensorflow running error with cublas
                            
                                TypeError: concat() got multiple values for argument 'axis'
                            
                                "'CXXABI_1.3.8' not found" in tensorflow-gpu - install from source
                            
                                'tuple' object has no attribute 'layer'
                            
                                What's the difference between tf.cond and if-else?
                            
                                Load Tensorflow js model from local file system in javascript
                            
                                Display MNIST image using matplotlib [duplicate]
                            
                                Tensorflow error : DLL load failed: The specified procedure could not be found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With