I'm trying to learn distributed TensorFlow. Tried out a piece code as explained here:
with tf.device("/cpu:0"):
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
with tf.device("/cpu:1"):
y = tf.nn.softmax(tf.matmul(x, W) + b)
loss = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
Getting the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'MatMul': Operation was explicitly assigned to /device:CPU:1 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device. [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:CPU:1"](Placeholder, Variable/read)]]
Meaning that TensorFlow does not recognize CPU:1.
I'm running on a RedHat server with 40 CPUs (cat /proc/cpuinfo | grep processor | wc -l
).
Any ideas?
Running TensorFlow on multicore CPUs can be an attractive option, e.g., where a workflow is dominated by IO and faster computational hardware has less impact on runtime, or simply where no GPUs are available. This talk will discuss which TensorFlow package to choose, and how to optimise performance on multicore CPUs.
If a TensorFlow operation has no corresponding GPU implementation, then the operation falls back to the CPU device. For example, since tf. cast only has a CPU kernel, on a system with devices CPU:0 and GPU:0 , the CPU:0 device is selected to run tf.
TensorFlow GPU OperationsBy default, if a GPU is available, TensorFlow will use it for all operations. You can control which GPU TensorFlow will use for a given operation, or instruct TensorFlow to use a CPU, even if a GPU is available.
The TensorFlow Session object is multithreaded, so multiple threads can easily use the same session and run ops in parallel.
Following the link in the comment:
Turns out the session should be configured to have device count > 1:
config = tf.ConfigProto(device_count={"CPU": 8})
with tf.Session(config=config) as sess:
...
Kind of shocking that I missed something so basic, and no one could pinpoint to an error which seems too obvious.
Not sure if it's a problem with me or the TensorFlow code samples and documentation. Since it's Google, I'll have to say that it's me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With