I am following the example here to create a distributed tensorflow model with a parameter server and n workers. I do not have any GPU, all work is distributed on CPU
In the chief worker, I want to save my variables every some steps, but invoking the saver results in the following exception :
Cannot assign a device to node 'save_1/RestoreV2_21':
Could not satisfy explicit device specification
'/job:ps/task:0/device:CPU:0' because no devices matching that
specification are registered in this process; available devices:
/job:localhost/replica:0/task:0/cpu:0
[[Node: save_1/RestoreV2_21 = RestoreV2[dtypes=[DT_INT32],
_device="/job:ps/task:0/device:CPU:0"](save_1/Const,
save_1/RestoreV2_21/tensor_names, save_1/RestoreV2_21/shape_and_slices)]]
I tried :
server = tf.train.Server(cluster,
job_name=self.calib.params['job_name'],
task_index=self.calib.params['task_index'],
config=tf.ConfigProto(allow_soft_placement=True)
I am using a supervisor :
sv = tf.train.Supervisor(
is_chief=is_chief,
...)
and creating my sesion as follows :
sess = sv.prepare_or_wait_for_session(server.target)
but I am still having the exact same error
This line in the error message:
available devices: /job:localhost/replica:0/task:0/cpu:0
...suggests that your tf.Session is not connected to the tf.train.Server you created. In particular, it seems to be a local (or "direct") session that can only access devices in the local process.
To fix this problem, when you create your session, pass server.target to the initializer. For example, depending on which API you are using to create the session, you might want to use one of the following:
# Creating a session explicitly.
with tf.Session(server.target) as sess:
# ...
# Using a `tf.train.Supervisor` called `sv`.
with sv.managed_session(server.target):
# ...
# Using a `tf.train.MonitoredTrainingSession`.
with tf.train.MonitoredTrainingSession(server.target):
# ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With