Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:
We have parameter server which we just launch in separate process and make server.join(). We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.
What I didn't find:
How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor. Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?
How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?
In the following gist:
https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca
global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.
Thanks in advance!
EDIT:
To q.3: It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions. Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?
Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.
Hope more knowledgeable people will correct me if needed.
Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session
Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.
Because of p.1. Replicated :)
Thanks to ideas from @YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:
So, high-level algorithm:
repeat 4-5 till convergence :)
As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With