I got confused about the two concepts: In-graph replication
and Between-graph replication
when reading the Replicated training in tensorflow's official How-to.
It's said in above link that
In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); ...
Does this mean there are multiple tf.Graph
s in Between-graph
replication
approach? If yes, where are the corresponding codes in
the provided examples?
While there is already a Between-graph replication
example in above link, could anyone provide a In-graph replication
implementation (pseudo code is fine) and highlight its main
differences from Between-graph replication
?
Thanks in advance!
Thanks a lot for your detailed explanations and gist code @mrry @YaroslavBulatov ! After looking your responses, I have the following two questions:
There is the following statement in Replicated training:
Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.
I have two sub-questions related to above words in bold.
(A) Why do we say each client builds similar graph, but not same graph?
I wonder the graph built in each client in the example of Replicated training
should be the same because below graph construction codes are shared within all worker
s.:
# Build model...
loss = ...
global_step = tf.Variable(0)
(B) Shouldn't it be multiple copies of compute-intensive part of
the model, since we have multiple workers
?
Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs? If not, can we
use simultaneously both the In-graph replication
to support training on multiple
GPUs on each machine and Between-graph replication
for
cross-machine training? I ask this question because
@mrry indicated that the In-graph replication
is essentially same to the way
used in CIFAR-10 example model for multiple GPUs.
First of all, for some historical context, "in-graph replication" is the first approach that we tried in TensorFlow, and it did not achieve the performance that many users required, so the more complicated "between-graph" approach is the current recommended way to perform distributed training. Higher-level libraries such as tf.learn
use the "between-graph" approach for distributed training.
To answer your specific questions:
Does this mean there are multiple
tf.Graph
s in the between-graph replication approach? If yes, where are the corresponding codes in the provided examples?
Yes. The typical between-graph replication setup will use a separate TensorFlow process for each worker replica, and each of this will build a separate tf.Graph
for the model. Usually each process uses the global default graph (accessible through tf.get_default_graph()
) and it is not created explicitly.
(In principle, you could use a single TensorFlow process with the same tf.Graph
and multiple tf.Session
objects that share the same underlying graph, as long as you configured the tf.ConfigProto.device_filters
option for each session differently, but this is an uncommon setup.)
While there is already a between-graph replication example in above link, could anyone provide an in-graph replication implementation (pseudocode is fine) and highlight its main differences from between-graph replication?
For historical reasons, there are not many examples of in-graph replication (Yaroslav's gist is one exception). A program using in-graph replication will typically include a loop that creates the same graph structure for each worker (e.g. the loop on line 74 of the gist), and use variable sharing between the workers.
The one place where in-graph replication persists is for using multiple devices in a single process (e.g. multiple GPUs). The CIFAR-10 example model for multiple GPUs is an example of this pattern (see the loop over GPU devices here).
(In my opinion, the inconsistency between how multiple workers and multiple devices in a single worker are treated is unfortunate. In-graph replication is simpler to understand than between-graph replication, because it doesn't rely on implicit sharing between the replicas. Higher-level libraries, such as tf.learn
and TF-Slim, hide some of these issues, and offer hope that we can offer a better replication scheme in the future.)
Why do we say each client builds a similar graph, but not the same graph?
Because they aren't required to be identical (and there is no integrity check that enforces this). In particular, each worker might create a graph with different explicit device assignments ("/job:worker/task:0"
, "/job:worker/task:1"
, etc.). The chief worker might create additional operations that are not created on (or used by) the non-chief workers. However, in most cases, the graphs are logically (i.e. modulo device assignments) the same.
Shouldn't it be multiple copies of the compute-intensive part of the model, since we have multiple workers?
Typically, each worker has a separate graph that contains a single copy of the compute-intensive part of the model. The graph for worker i does not contain the nodes for worker j (assuming i ≠ j). (An exception would be the case where you're using between-graph replication for distributed training, and in-graph replication for using multiple GPUs in each worker. In that case, the graph for a worker would typically contain N copies of the compute-intensive part of the graph, where N is the number of GPUs in that worker.)
Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs?
The example code only covers training on multiple machines, and says nothing about how to train on multiple GPUs in each machine. However, the techniques compose easily. In this part of the example:
# Build model...
loss = ...
...you could add a loop over the GPUs in the local machine, to achieve distributed training multiple workers each with multiple GPUs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With