I'm training a CNN model with tensorflow. I only achieve a GPU utilization of 60% (+- 2-3%) without big drops.
Sun Oct 23 11:34:26 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:01:00.0 Off | N/A |
| 1% 53C P2 90W / 170W | 7823MiB / 8113MiB | 60% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3644 C /usr/bin/python2.7 7821MiB |
+-----------------------------------------------------------------------------+
Since it's a Pascal card I am using CUDA 8 with cudnn 5.1.5 The CPU usage is around 50% (evenly distributed over 8 threads. i7 4770k), so the CPU performance should not be the bottleneck.
I'm using the binary file format of Tensorflow with and read with tf.TFRecordReader()
I'm creating batches of images like this:
#Uses tf.TFRecordReader() to read single Example
label, image = read_and_decode_single_example(filename_queue=filename_queue)
image = tf.image.decode_jpeg(image.values[0], channels=3)
jpeg = tf.cast(image, tf.float32) / 255.
jpeg.set_shape([66,200,3])
images_batch, labels_batch = tf.train.shuffle_batch(
[jpeg, label], batch_size= FLAGS.batch_size,
num_threads=8,
capacity=2000, #tried bigger values here, does not change the performance
min_after_dequeue=1000) #here too
Here is my training loop:
sess = tf.Session()
sess.run(init)
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
labels, images = sess.run([labels_batch, images_batch])
feed_dict = {images_placeholder: images, labels_placeholder: labels}
_, loss_value = sess.run([train_op, loss],
feed_dict=feed_dict)
I don't have much experience with tensorflow, and I don't now where the bottleneck could be. If you need any more code snippets to help identify the issue, I will provide them.
UPDATE: Bandwidth test results
==5172== NVPROF is profiling process 5172, command: ./bandwidthtest
Device: GeForce GTX 1070
Transfer size (MB): 3960
Pageable transfers
Host to Device bandwidth (GB/s): 7.066359
Device to Host bandwidth (GB/s): 6.850315
Pinned transfers
Host to Device bandwidth (GB/s): 12.038037
Device to Host bandwidth (GB/s): 12.683915
==5172== Profiling application: ./bandwidthtest
==5172== Profiling result:
Time(%) Time Calls Avg Min Max Name
50.03% 933.34ms 2 466.67ms 327.33ms 606.01ms [CUDA memcpy DtoH]
49.97% 932.32ms 2 466.16ms 344.89ms 587.42ms [CUDA memcpy HtoD]
==5172== API calls:
Time(%) Time Calls Avg Min Max Name
46.60% 1.86597s 4 466.49ms 327.36ms 606.15ms cudaMemcpy
35.43% 1.41863s 2 709.31ms 632.94ms 785.69ms cudaMallocHost
17.89% 716.33ms 2 358.17ms 346.14ms 370.19ms cudaFreeHost
0.04% 1.5572ms 1 1.5572ms 1.5572ms 1.5572ms cudaMalloc
0.02% 708.41us 1 708.41us 708.41us 708.41us cudaFree
0.01% 203.58us 1 203.58us 203.58us 203.58us cudaGetDeviceProperties
0.00% 187.55us 1 187.55us 187.55us 187.55us cuDeviceTotalMem
0.00% 162.41us 91 1.7840us 105ns 61.874us cuDeviceGetAttribute
0.00% 79.979us 4 19.994us 1.9580us 73.537us cudaEventSynchronize
0.00% 77.074us 8 9.6340us 1.5860us 28.925us cudaEventRecord
0.00% 19.282us 1 19.282us 19.282us 19.282us cuDeviceGetName
0.00% 17.891us 4 4.4720us 629ns 8.6080us cudaEventDestroy
0.00% 16.348us 4 4.0870us 818ns 8.8600us cudaEventCreate
0.00% 7.3070us 4 1.8260us 1.7040us 2.0680us cudaEventElapsedTime
0.00% 1.6670us 3 555ns 128ns 1.2720us cuDeviceGetCount
0.00% 813ns 3 271ns 142ns 439ns cuDeviceGet
If you're getting less than 80-90% GPU usage in demanding games, you most likely have a CPU bottleneck. The CPU has to feed data to the GPU. Your GPU has nothing to work on if the CPU can't send enough data. This problem shows up when you pair a powerful graphics card with a low-end CPU.
TensorFlow (TF) GPU 1.6 and above requires cuda compute capability (ccc) of 3.5 or higher and requires AVX instruction support.
To limit TensorFlow to a specific set of GPUs, use the tf. config. set_visible_devices method. In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process.
After getting some more experience with tensorflow I realized that the GPU usage heavily depends on the networks size, batch size and preprocessing. Using a bigger network with more conv layers (Resnet style for example) increases the GPU usage because more computations are involved and less overhead (in relation to the computation) is produced by transferring data etc.
One potential bottleneck is the PCI Express bus usage between the CPU and the GPU, when loading the images to the GPU. You can use some tools to measure it.
Another potential bottleneck is disk IO, I don't see anything in your code that would cause it but it's always a good idea to keep an eye on it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With