Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring time it takes to move data from RAM to GPU memory in Tensorflow

I would like to perform the following simple experiment.

I am using Tensorflow. I have a large array (5000x5000 float32 elements). How do I measure how long it actually takes to move this array from RAM to GPU memory?

I understand that I could create some very simple computational graph, run it and measure how long it took. There are two problems with this though. First, I am worried that the time measured will be dominated by the time taken by the computation and not by moving the data from RAM do GPU. Second, if the computation doesn't involve the big array I mentioned, Tensorflow will simplify the computational graph such that the big array won't be in it and it won't get moved from RAM to GPU at all.

like image 800
Deeplearningmaniac Avatar asked Apr 04 '18 15:04

Deeplearningmaniac


People also ask

Does TensorFlow automatically use GPU?

If a TensorFlow operation has both CPU and GPU implementations, TensorFlow will automatically place the operation to run on a GPU device first. If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically.


1 Answers

The solution is to make a simple benchmark where transfer of memory dominates. To check that TensorFlow doesn't optimize your transfer away, you can add a tiny operation on the result. Overhead of tiny operation like fill should be a couple of microseconds, which is insignificant compared to loading 100MB into GPU, which is >5 milliseconds.

def feed_gpu_tensor():
  params0 = create_array()
  with tf.device('/gpu:0'):
    params = tf.placeholder(tf.float32)
    result = tf.concat([params, tf.fill([1],1.0)], axis=0)
  for i in range(args.num_iters):
    with timeit('feed_gpu_tensor'):
      sess.run(result.op, feed_dict = {params: params0})

To run this benchmark, you can do this

wget https://github.com/diux-dev/cluster/blob/master/yuxin_numpy/tf_numpy_benchmark.py
python tf_numpy_benchmark.py --benchmark=feed_gpu_tensor

I found that on p3.16xlarge, with tcmalloc (through LD_PRELOAD), this copy (100MB) will take 8 milliseconds.

Also, as a sanity check you can look at timelines. A timeline will have MEMCPYH2D op which is the actual CPU->GPU copy, and you can use this to confirm that it dominates your microbenchmark step run-time enter image description here

Related issues:

  • benchmarking D2H and H2D: https://github.com/tensorflow/tensorflow/issues/17204

  • 64-byte aligning input data: https://github.com/tensorflow/tensorflow/issues/17233

like image 53
Yaroslav Bulatov Avatar answered Sep 21 '22 16:09

Yaroslav Bulatov