Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Tensorflow: who applies the parameter update?

I've used TensorFlow but am new to distributed TensorFlow for training models. My understanding is that current best practices favor the data-parallel model with asynchronous updates:

A paper published by the Google Brain team in April 2016 benchmarked various approaches and found that data parallelism with synchronous updates using a few spare replicas was the most efficient, not only converging faster but also producing a better model. -- Chapter 12 of Hands-On Machine Learning with Scikit-Learn and Tensorflow.

Now, my confusion from reading further about this architecture is figuring out which component applies the parameter updates: the workers or the parameter server?

In my illustration below, it's clear to me that the workers compute the gradients dJ/dw (the gradient of the loss J with respect to the parameter weights w). But who applies the gradient descent update rule?

enter image description here

What's a bit confusing is that this O'Reilly article on Distributed TensorFlow states the following:

In the more centralized architecture, the devices send their output in the form of gradients to the parameter servers. These servers collect and aggregate the gradients. In synchronous training, the parameter servers compute the latest up-to-date version of the model, and send it back to devices. In asynchronous training, parameter servers send gradients to devices that locally compute the new model. In both architectures, the loop repeats until training terminates.

The above paragraph suggests that in asynchronous training:

  1. The workers compute gradients and send it to the parameter server.
  2. The parameter server broadcasts the gradients to the workers.
  3. Each worker receives the broadcasted gradients and applies the update rule.

Is my understanding correct? If it is, then that doesn't seem very asynchronous to me because the workers have to wait for the parameter server to broadcast the gradients. Any explanation would be appreciated.

like image 344
stackoverflowuser2010 Avatar asked Jul 31 '18 20:07

stackoverflowuser2010


People also ask

How does parameter server work?

Parameter servers are a core part of many machine learning applications. Their role is to store the parameters of a machine learning model (e.g., the weights of a neural network) and to serve them to clients (clients are often workers that process data and compute updates to the parameters).

How MirroredStrategy works?

MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable .

What is parameter server?

Parameter server training is a common data-parallel method to scale up model training on multiple machines. A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and they are read and updated by workers in each step.

What is distributed training?

In distributed training the workload to train a model is split up and shared among multiple mini processors, called worker nodes. These worker nodes work in parallel to speed up model training.


1 Answers

I realize this was asked in 2018, but let's give it a shot.

  1. Each Worker compute gradients
  2. When a worker is done computing gradients it sends it to the parameter server.
  3. The worker then gets sent the new parameters from the parameter server, without waiting for the other workers.

In the synchronous part, the workers will not continue training before every worker has sent its update to the server.

What this means in the asynchronous case is that every worker can have slightly different gradients, because they are fetching the gradients without waiting for every worker to update the parameter server.

like image 194
bjotta Avatar answered Oct 21 '22 21:10

bjotta