Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Tensorflow: ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"

I am trying to write a distributed variational auto encoder on tensorflow in standalone mode.

My cluster includes 3 machines, naming m1, m2 and m3. I am trying to run 1 ps server on m1, and 2 worker servers on m2 and m3. (Example trainer program in distributed tensorflow documentation) On m3 I got the following error message:

Traceback (most recent call last): 
 File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module> 
   save_model_secs=600) 
 File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__ 
   self._verify_setup() 
 File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup 
   "their device set: %s" % op) 
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable" 
attr { 
 key: "container" 
 value { 
   s: "" 
 } 
} 
attr { 
 key: "dtype" 
 value { 
   type: DT_INT32 
 } 
} 
attr { 
 key: "shape" 
 value { 
   shape { 
   } 
 } 
} 
attr { 
 key: "shared_name" 
 value { 
   s: "" 
 } 
}

And this is the part of my code, which defines the network and Supervisor.

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":

    #set distributed device
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=clusterSpec)):

        # Build the training computation graph
        x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
        optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
        with tf.variable_scope("model") as scope:
            with pt.defaults_scope(phase=pt.Phase.train):
                train_model = M1(n_z, x_train.shape[1])
                train_vz_mean, train_vz_logstd = q_net(x, n_z)
                train_variational = ReparameterizedNormal(
                    train_vz_mean, train_vz_logstd)
                grads, lower_bound = advi(
                    train_model, x, train_variational, lb_samples, optimizer)
                infer = optimizer.apply_gradients(grads)
        #print(type(lower_bound))

        # Build the evaluation computation graph
        with tf.variable_scope("model", reuse=True) as scope:
            with pt.defaults_scope(phase=pt.Phase.test):
                eval_model = M1(n_z, x_train.shape[1])
                eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
                eval_variational = ReparameterizedNormal(
                    eval_vz_mean, eval_vz_logstd)
                eval_lower_bound = is_loglikelihood(
                    eval_model, x, eval_variational, lb_samples)
                eval_log_likelihood = is_loglikelihood(
                    eval_model, x, eval_variational, ll_samples)

    #saver = tf.train.Saver()
    summary_op = tf.merge_all_summaries()
    global_step = tf.Variable(0)
    init_op = tf.initialize_all_variables()

    # Create a "supervisor", which oversees the training process.
    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), 
                             logdir=LogDir,
                             init_op=init_op,
                             summary_op=summary_op,
    #                         saver=saver,
                             global_step=global_step,
                             save_model_secs=600)
    print("create sv done")

I think there must be something wrong with my code, but I don't know how to fix it. Any advice? Thanks a lot!

like image 603
sproblvem Avatar asked Aug 05 '16 16:08

sproblvem


1 Answers

The problem stems from the definition of your global_step variable:

global_step = tf.Variable(0)

This definition is outside the scope of the with tf.device(tf.train.replica_device_setter(...)): block above, so no device is assigned to global_step. In replicated training, this is often a source of error (because if different replicas decide to place the variable on a different device, they won't share the same value), so TensorFlow includes a sanity check that prevents this.

Fortunately, the solution is simple. You can either define global_step inside the with tf.device(tf.train.replica_device_setter(...)): block above, or add a small with tf.device("/job:ps/task:0"): block as follows:

with tf.device("/job:ps/task:0"):
    global_step = tf.Variable(0, name="global_step")
like image 198
mrry Avatar answered Sep 19 '22 08:09

mrry