Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly is a device in TensorFlow?

Tags:

tensorflow

I would be extremely helpful to have a clear definition what a device in TensorFlow is exactly. Is a device a single processing unit (no "real" concurrency possible)?

You can define as many devices as you want by doing the following:

config = tf.ConfigProto(device_count={"CPU": 2},
                    inter_op_parallelism_threads=2,
                    intra_op_parallelism_threads=1)
sess = tf.Session(config=config)

How is it possible that you can define as many devices as you want despite having only one processor with 4 cores?

like image 327
LaLa Avatar asked Oct 15 '16 17:10

LaLa


1 Answers

Too long for a comment (perhaps @mrry or @keveman could give an official definition), but here are some observations:

  • A logical device in TensorFlow is a computation unit with its own memory.
  • TensorFlow scheduler adds Send/Recv ops to copy data to proper device when data crosses cross device boundaries
  • It's a logical device so you can have more logical devices than physical devices (cores) and some of the ops on available "devices" may be scheduled but sit idly waiting until a physical device frees up. For CPU devices, you may have more threads than you have cores, so a OS thread scheduler selects subset of threads to run at any given moment
  • An op scheduled on logical tf.device("gpu:0") device can keep its data on in main memory (ie, physical CPU device), so logical device boundary is sometimes violated in practice. This is the HostMemory annotation you see in ops like integer Add here. This allows one to do ops like shape manipulation on logical device GPU and avoid crossing logical device boundary (Send/Recv ops) even though the data is not stored on physical device GPU.
  • Using device_count={"CPU": m}...intra_op_parallelism_threads=n creates multiple Eigen thread-pools with n threads each, so you can manually partition your graph to run m ops in parallel where each op will request n threads. However you can't run more threads concurrently than you have physical cores so this may be slow.
  • Logical devices like cpu:0 are not pinned specific cores, so they can use whichever cores are available
  • You can see what was the actual parallelism by looking at timelines

Here's an example of creating 8 CPU devices and running 2 matmul's in parallel: https://gist.github.com/yaroslavvb/9a5f4a0b613c79152152b35c0bc840b8

The core graph construction looks like this

with tf.device("cpu:0"):
    a1 = tf.ones((n, n))
    a2 = tf.ones((n, n))
with tf.device("cpu:1"):
    a3 = tf.matmul(a1, a2)
with tf.device("cpu:2"):
    a4 = tf.matmul(a1, a2)
with tf.device("cpu:3"):
    a5 = tf.matmul(a3, a4)

If you run the gist you look at the run_metadata partition graphs section that's printed, you see Send/Recv ops added that transfer data between CPU devices, ie something like this

partition_graphs {
  node {
    name: "MatMul_1/_11"
    op: "_Recv"
    device: "/job:localhost/replica:0/task:0/cpu:3"
    attr {
      key: "client_terminated"
      value {
        b: false
      }
    }
    attr {
      key: "recv_device"
      value {
        s: "/job:localhost/replica:0/task:0/cpu:3"
      }
    }
    attr {
      key: "send_device"
      value {
        s: "/job:localhost/replica:0/task:0/cpu:2"
      }
    }

So you see that there's a Send op scheduled to transfer data from cpu:2 to cpu:3. Since all CPU devices share memory, this op doesn't do anything, but it may do something in the future if TensorFlow becomes NUMA aware.

Also, you can open timeline.json in browser under chrome://tracing and see the timing

enter image description here

You can see it runs two 1024x1024 matrix multiplications in parallel, each one taking about 85ms, which comes down to 25 M ops/second, decent for single core performance of 2 year-old macbook.

On other hand, you can run 6 of such matrix multiplications on 6 different CPU devices, and you'll see something like this.

enter image description here

I only have 4 physical cores, and you see that 2 of the operations take 2x longer. Even though they were active on a logical cpu device, there were any available physical cores for first 100ms so they weren't making any progress.

like image 154
Yaroslav Bulatov Avatar answered Sep 24 '22 10:09

Yaroslav Bulatov