I've read Distributed Tensorflow Doc, and it mentions that in asynchronous training,
each replica of the graph has an independent training loop that executes without coordination.
From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?
TensorFlow can execute operations synchronously or asynchronously. If asynchronous execution is enabled, operations may return "non-ready" handles.
distribute. Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.
The TensorFlow runtime parallelizes graph execution across many different dimensions: The individual ops have parallel implementations, using multiple cores in a CPU, or multiple threads in a GPU.
If a TensorFlow operation has both CPU and GPU implementations, TensorFlow will automatically place the operation to run on a GPU device first. If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically.
When you train asynchronously in Distributed TensorFlow, a particular worker does the following:
The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).
The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.
The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.
In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True
flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer
) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.
(By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer
will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With