I've used TensorFlow
but am new to distributed TensorFlow
for training models. My understanding is that current best practices favor the data-parallel model with asynchronous updates:
A paper published by the Google Brain team in April 2016 benchmarked various approaches and found that data parallelism with synchronous updates using a few spare replicas was the most efficient, not only converging faster but also producing a better model. -- Chapter 12 of Hands-On Machine Learning with Scikit-Learn and Tensorflow.
Now, my confusion from reading further about this architecture is figuring out which component applies the parameter updates: the workers or the parameter server?
In my illustration below, it's clear to me that the workers compute the gradients dJ/dw
(the gradient of the loss J with respect to the parameter weights w). But who applies the gradient descent update rule?
What's a bit confusing is that this O'Reilly article on Distributed TensorFlow states the following:
In the more centralized architecture, the devices send their output in the form of gradients to the parameter servers. These servers collect and aggregate the gradients. In synchronous training, the parameter servers compute the latest up-to-date version of the model, and send it back to devices. In asynchronous training, parameter servers send gradients to devices that locally compute the new model. In both architectures, the loop repeats until training terminates.
The above paragraph suggests that in asynchronous training:
Is my understanding correct? If it is, then that doesn't seem very asynchronous to me because the workers have to wait for the parameter server to broadcast the gradients. Any explanation would be appreciated.
Parameter servers are a core part of many machine learning applications. Their role is to store the parameters of a machine learning model (e.g., the weights of a neural network) and to serve them to clients (clients are often workers that process data and compute updates to the parameters).
MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable .
Parameter server training is a common data-parallel method to scale up model training on multiple machines. A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and they are read and updated by workers in each step.
In distributed training the workload to train a model is split up and shared among multiple mini processors, called worker nodes. These worker nodes work in parallel to speed up model training.
I realize this was asked in 2018, but let's give it a shot.
In the synchronous part, the workers will not continue training before every worker has sent its update to the server.
What this means in the asynchronous case is that every worker can have slightly different gradients, because they are fetching the gradients without waiting for every worker to update the parameter server.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With