Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Tensorflow save fails no device

Tags:

tensorflow

I am following the example here to create a distributed tensorflow model with a parameter server and n workers. I do not have any GPU, all work is distributed on CPU

In the chief worker, I want to save my variables every some steps, but invoking the saver results in the following exception :

Cannot assign a device to node 'save_1/RestoreV2_21': 
Could not satisfy explicit device specification 
'/job:ps/task:0/device:CPU:0' because no devices matching that 
specification are registered in this process; available devices: 
/job:localhost/replica:0/task:0/cpu:0

[[Node: save_1/RestoreV2_21 = RestoreV2[dtypes=[DT_INT32],
_device="/job:ps/task:0/device:CPU:0"](save_1/Const, 
save_1/RestoreV2_21/tensor_names, save_1/RestoreV2_21/shape_and_slices)]]

I tried :

server = tf.train.Server(cluster,
                         job_name=self.calib.params['job_name'],
                         task_index=self.calib.params['task_index'],
                         config=tf.ConfigProto(allow_soft_placement=True)

I am using a supervisor :

sv = tf.train.Supervisor(
                         is_chief=is_chief,
                        ...)

and creating my sesion as follows :

sess = sv.prepare_or_wait_for_session(server.target)

but I am still having the exact same error

like image 808
volatile Avatar asked Jun 01 '26 10:06

volatile


1 Answers

This line in the error message:

available devices: /job:localhost/replica:0/task:0/cpu:0

...suggests that your tf.Session is not connected to the tf.train.Server you created. In particular, it seems to be a local (or "direct") session that can only access devices in the local process.

To fix this problem, when you create your session, pass server.target to the initializer. For example, depending on which API you are using to create the session, you might want to use one of the following:

# Creating a session explicitly.
with tf.Session(server.target) as sess:
  # ...

# Using a `tf.train.Supervisor` called `sv`.
with sv.managed_session(server.target):
  # ...

# Using a `tf.train.MonitoredTrainingSession`.
with tf.train.MonitoredTrainingSession(server.target):
  # ...
like image 186
mrry Avatar answered Jun 03 '26 21:06

mrry



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!