I am trying to write a distributed variational auto encoder on tensorflow in standalone mode
.
My cluster includes 3 machines, naming m1, m2 and m3. I am trying to run 1 ps server on m1, and 2 worker servers on m2 and m3. (Example trainer program in distributed tensorflow documentation) On m3 I got the following error message:
Traceback (most recent call last):
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module>
save_model_secs=600)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__
self._verify_setup()
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup
"their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable"
attr {
key: "container"
value {
s: ""
}
}
attr {
key: "dtype"
value {
type: DT_INT32
}
}
attr {
key: "shape"
value {
shape {
}
}
}
attr {
key: "shared_name"
value {
s: ""
}
}
And this is the part of my code, which defines the network and Supervisor.
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
#set distributed device
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=clusterSpec)):
# Build the training computation graph
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
with tf.variable_scope("model") as scope:
with pt.defaults_scope(phase=pt.Phase.train):
train_model = M1(n_z, x_train.shape[1])
train_vz_mean, train_vz_logstd = q_net(x, n_z)
train_variational = ReparameterizedNormal(
train_vz_mean, train_vz_logstd)
grads, lower_bound = advi(
train_model, x, train_variational, lb_samples, optimizer)
infer = optimizer.apply_gradients(grads)
#print(type(lower_bound))
# Build the evaluation computation graph
with tf.variable_scope("model", reuse=True) as scope:
with pt.defaults_scope(phase=pt.Phase.test):
eval_model = M1(n_z, x_train.shape[1])
eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
eval_variational = ReparameterizedNormal(
eval_vz_mean, eval_vz_logstd)
eval_lower_bound = is_loglikelihood(
eval_model, x, eval_variational, lb_samples)
eval_log_likelihood = is_loglikelihood(
eval_model, x, eval_variational, ll_samples)
#saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
global_step = tf.Variable(0)
init_op = tf.initialize_all_variables()
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=summary_op,
# saver=saver,
global_step=global_step,
save_model_secs=600)
print("create sv done")
I think there must be something wrong with my code, but I don't know how to fix it. Any advice? Thanks a lot!
The problem stems from the definition of your global_step
variable:
global_step = tf.Variable(0)
This definition is outside the scope of the with tf.device(tf.train.replica_device_setter(...)):
block above, so no device is assigned to global_step
. In replicated training, this is often a source of error (because if different replicas decide to place the variable on a different device, they won't share the same value), so TensorFlow includes a sanity check that prevents this.
Fortunately, the solution is simple. You can either define global_step
inside the with tf.device(tf.train.replica_device_setter(...)):
block above, or add a small with tf.device("/job:ps/task:0"):
block as follows:
with tf.device("/job:ps/task:0"):
global_step = tf.Variable(0, name="global_step")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With