Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Between-graph replication in tensorflow: sessions and variables

Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:

We have parameter server which we just launch in separate process and make server.join(). We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.

What I didn't find:

  1. How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor. Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?

  2. How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?

  3. In the following gist:

https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca

global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.

  1. I have main session in core program where I perform training of the model. Part of training loop is collection of data, which in turn requires calculation on the tensorflow cluster. So I need to create this cluster, put on the parameter server current state of the trained model, then collect data from calculation and continue with training loop. How can I: 1) pass current trained model to the cluster? 2) Extract collected data from cluster and pass it to main program?

Thanks in advance!

EDIT:

To q.3: It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions. Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?

like image 738
dd210 Avatar asked Apr 21 '26 01:04

dd210


1 Answers

Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.

Hope more knowledgeable people will correct me if needed.

  1. We have separate sessions on all the servers, and these sessions share their resources (variables, queues, and readers) but only in distributed setting (i.e. you pass server.target to tf.Session constructor).

Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session

  1. Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.

  2. Because of p.1. Replicated :)

  3. Thanks to ideas from @YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:

    • Cluster: one local "calculation server" and N "workers".
    • "Calculation server" keeps all the parameters of global network and performs training steps. All training ops are assigned to it.
    • "Workers" collect data in parallel and then put it in Queue; these data are used by "calculation server" when doing training steps.

So, high-level algorithm:

  1. launch all the units in cluster
  2. build comp graph and training ops on calculation server
  3. build comp graph on workers (variables are linked to calculation server).
  4. collect data with workers
  5. perform training step on calculation server and update global network

repeat 4-5 till convergence :)

As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.

like image 96
dd210 Avatar answered Apr 23 '26 20:04

dd210