I would be extremely helpful to have a clear definition what a device in TensorFlow is exactly. Is a device a single processing unit (no "real" concurrency possible)?
You can define as many devices as you want by doing the following:
config = tf.ConfigProto(device_count={"CPU": 2},
inter_op_parallelism_threads=2,
intra_op_parallelism_threads=1)
sess = tf.Session(config=config)
How is it possible that you can define as many devices as you want despite having only one processor with 4 cores?
Too long for a comment (perhaps @mrry or @keveman could give an official definition), but here are some observations:
tf.device("gpu:0")
device can keep its data on in main memory (ie, physical CPU device), so logical device boundary is sometimes violated in practice. This is the HostMemory
annotation you see in ops like integer Add
here. This allows one to do ops like shape manipulation on logical device GPU and avoid crossing logical device boundary (Send/Recv ops) even though the data is not stored on physical device GPU.device_count={"CPU": m}...intra_op_parallelism_threads=n
creates multiple Eigen thread-pools with n
threads each, so you can manually partition your graph to run m
ops in parallel where each op will request n
threads. However you can't run more threads concurrently than you have physical cores so this may be slow.cpu:0
are not pinned specific cores, so they can use whichever cores are availableHere's an example of creating 8 CPU devices and running 2 matmul's in parallel: https://gist.github.com/yaroslavvb/9a5f4a0b613c79152152b35c0bc840b8
The core graph construction looks like this
with tf.device("cpu:0"):
a1 = tf.ones((n, n))
a2 = tf.ones((n, n))
with tf.device("cpu:1"):
a3 = tf.matmul(a1, a2)
with tf.device("cpu:2"):
a4 = tf.matmul(a1, a2)
with tf.device("cpu:3"):
a5 = tf.matmul(a3, a4)
If you run the gist you look at the run_metadata
partition graphs section that's printed, you see Send/Recv
ops added that transfer data between CPU devices, ie something like this
partition_graphs {
node {
name: "MatMul_1/_11"
op: "_Recv"
device: "/job:localhost/replica:0/task:0/cpu:3"
attr {
key: "client_terminated"
value {
b: false
}
}
attr {
key: "recv_device"
value {
s: "/job:localhost/replica:0/task:0/cpu:3"
}
}
attr {
key: "send_device"
value {
s: "/job:localhost/replica:0/task:0/cpu:2"
}
}
So you see that there's a Send
op scheduled to transfer data from cpu:2
to cpu:3
. Since all CPU devices share memory, this op doesn't do anything, but it may do something in the future if TensorFlow becomes NUMA aware.
Also, you can open timeline.json
in browser under chrome://tracing
and see the timing
You can see it runs two 1024x1024 matrix multiplications in parallel, each one taking about 85ms, which comes down to 25 M ops/second, decent for single core performance of 2 year-old macbook.
On other hand, you can run 6 of such matrix multiplications on 6 different CPU devices, and you'll see something like this.
I only have 4 physical cores, and you see that 2 of the operations take 2x longer. Even though they were active on a logical cpu
device, there were any available physical cores for first 100ms so they weren't making any progress.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With