Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow: Using Parameter Servers in Distributed Training

It's not totally clear how parameter servers know what to do in a distributed tensor flow training.

For example, in this SO question, the following code is used to configure parameter server and worker tasks:

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    ##some training code

How does server.join() indicate the given task should be a parameter server? Is parameter serving a kind of default behavior for tasks? Is there anything else you can/should tell a parameter serving task to do?

Edit: This SO question addresses some of my question: "The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers." But how does a parameter server know it is a parameter server? Is server.join() enough?

like image 720
theis188 Avatar asked Dec 07 '22 20:12

theis188


1 Answers

TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. Some of these processes have devices whose names start with "/job:ps", and these hold the variables. The workers drive the training process, and when they run the train_op they will cause work to happen on the "/job:ps" devices, which will update the shared variables.

The server.join() method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented).

In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code. If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. The with tf.device(tf.train.replica_device_setter(...)): block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}" (for different values of {i}, chosen in a round-robin fashion).

When you call sess.run(train_op), TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. This part of the computation will happen on the "/job:ps" devices, so those devices will act like a parameter server.

like image 156
mrry Avatar answered Dec 10 '22 17:12

mrry