I would like to perform the following simple experiment.
I am using Tensorflow. I have a large array (5000x5000 float32 elements). How do I measure how long it actually takes to move this array from RAM to GPU memory?
I understand that I could create some very simple computational graph, run it and measure how long it took. There are two problems with this though. First, I am worried that the time measured will be dominated by the time taken by the computation and not by moving the data from RAM do GPU. Second, if the computation doesn't involve the big array I mentioned, Tensorflow will simplify the computational graph such that the big array won't be in it and it won't get moved from RAM to GPU at all.
If a TensorFlow operation has both CPU and GPU implementations, TensorFlow will automatically place the operation to run on a GPU device first. If you have more than one GPU, the GPU with the lowest ID will be selected by default. However, TensorFlow does not place operations into multiple GPUs automatically.
The solution is to make a simple benchmark where transfer of memory dominates. To check that TensorFlow doesn't optimize your transfer away, you can add a tiny operation on the result. Overhead of tiny operation like fill should be a couple of microseconds, which is insignificant compared to loading 100MB into GPU, which is >5 milliseconds.
def feed_gpu_tensor():
params0 = create_array()
with tf.device('/gpu:0'):
params = tf.placeholder(tf.float32)
result = tf.concat([params, tf.fill([1],1.0)], axis=0)
for i in range(args.num_iters):
with timeit('feed_gpu_tensor'):
sess.run(result.op, feed_dict = {params: params0})
To run this benchmark, you can do this
wget https://github.com/diux-dev/cluster/blob/master/yuxin_numpy/tf_numpy_benchmark.py
python tf_numpy_benchmark.py --benchmark=feed_gpu_tensor
I found that on p3.16xlarge, with tcmalloc (through LD_PRELOAD), this copy (100MB) will take 8 milliseconds.
Also, as a sanity check you can look at timelines. A timeline will have MEMCPYH2D op which is the actual CPU->GPU copy, and you can use this to confirm that it dominates your microbenchmark step run-time
Related issues:
benchmarking D2H and H2D: https://github.com/tensorflow/tensorflow/issues/17204
64-byte aligning input data: https://github.com/tensorflow/tensorflow/issues/17233
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With