Tensorflow GPU utilization only 60% (GTX 1070)

Tags:

I'm training a CNN model with tensorflow. I only achieve a GPU utilization of 60% (+- 2-3%) without big drops.

Sun Oct 23 11:34:26 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:01:00.0     Off |                  N/A |
|  1%   53C    P2    90W / 170W |   7823MiB /  8113MiB |     60%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3644    C   /usr/bin/python2.7                            7821MiB |
+-----------------------------------------------------------------------------+

Since it's a Pascal card I am using CUDA 8 with cudnn 5.1.5 The CPU usage is around 50% (evenly distributed over 8 threads. i7 4770k), so the CPU performance should not be the bottleneck.

I'm using the binary file format of Tensorflow with and read with tf.TFRecordReader()

I'm creating batches of images like this:

#Uses tf.TFRecordReader() to read single Example
label, image = read_and_decode_single_example(filename_queue=filename_queue) 
image = tf.image.decode_jpeg(image.values[0], channels=3)
jpeg = tf.cast(image, tf.float32) / 255.
jpeg.set_shape([66,200,3])
images_batch, labels_batch = tf.train.shuffle_batch(
    [jpeg, label], batch_size= FLAGS.batch_size,
    num_threads=8,
    capacity=2000, #tried bigger values here, does not change the performance
    min_after_dequeue=1000) #here too

Here is my training loop:

sess = tf.Session()

sess.run(init)
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
    labels, images = sess.run([labels_batch, images_batch])
    feed_dict = {images_placeholder: images, labels_placeholder: labels}
    _, loss_value = sess.run([train_op, loss],
                                 feed_dict=feed_dict)

I don't have much experience with tensorflow, and I don't now where the bottleneck could be. If you need any more code snippets to help identify the issue, I will provide them.

UPDATE: Bandwidth test results

==5172== NVPROF is profiling process 5172, command: ./bandwidthtest

Device: GeForce GTX 1070
Transfer size (MB): 3960

Pageable transfers
  Host to Device bandwidth (GB/s): 7.066359
  Device to Host bandwidth (GB/s): 6.850315

Pinned transfers
  Host to Device bandwidth (GB/s): 12.038037
  Device to Host bandwidth (GB/s): 12.683915

==5172== Profiling application: ./bandwidthtest
==5172== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 50.03%  933.34ms         2  466.67ms  327.33ms  606.01ms  [CUDA memcpy DtoH]
 49.97%  932.32ms         2  466.16ms  344.89ms  587.42ms  [CUDA memcpy HtoD]

==5172== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 46.60%  1.86597s         4  466.49ms  327.36ms  606.15ms  cudaMemcpy
 35.43%  1.41863s         2  709.31ms  632.94ms  785.69ms  cudaMallocHost
 17.89%  716.33ms         2  358.17ms  346.14ms  370.19ms  cudaFreeHost
  0.04%  1.5572ms         1  1.5572ms  1.5572ms  1.5572ms  cudaMalloc
  0.02%  708.41us         1  708.41us  708.41us  708.41us  cudaFree
  0.01%  203.58us         1  203.58us  203.58us  203.58us  cudaGetDeviceProperties
  0.00%  187.55us         1  187.55us  187.55us  187.55us  cuDeviceTotalMem
  0.00%  162.41us        91  1.7840us     105ns  61.874us  cuDeviceGetAttribute
  0.00%  79.979us         4  19.994us  1.9580us  73.537us  cudaEventSynchronize
  0.00%  77.074us         8  9.6340us  1.5860us  28.925us  cudaEventRecord
  0.00%  19.282us         1  19.282us  19.282us  19.282us  cuDeviceGetName
  0.00%  17.891us         4  4.4720us     629ns  8.6080us  cudaEventDestroy
  0.00%  16.348us         4  4.0870us     818ns  8.8600us  cudaEventCreate
  0.00%  7.3070us         4  1.8260us  1.7040us  2.0680us  cudaEventElapsedTime
  0.00%  1.6670us         3     555ns     128ns  1.2720us  cuDeviceGetCount
  0.00%     813ns         3     271ns     142ns     439ns  cuDeviceGet

456

asked Oct 23 '16 09:10

andre_bauer

2 Answers

After getting some more experience with tensorflow I realized that the GPU usage heavily depends on the networks size, batch size and preprocessing. Using a bigger network with more conv layers (Resnet style for example) increases the GPU usage because more computations are involved and less overhead (in relation to the computation) is produced by transferring data etc.

answered Oct 24 '22 19:10

andre_bauer

One potential bottleneck is the PCI Express bus usage between the CPU and the GPU, when loading the images to the GPU. You can use some tools to measure it.

Another potential bottleneck is disk IO, I don't see anything in your code that would cause it but it's always a good idea to keep an eye on it.

answered Oct 24 '22 20:10

Franck Dernoncourt

Related questions
                            
                                SQLite for client-server
                            
                                Difference between these connection strings?
                            
                                Sampling in visual vm
                            
                                Pros and cons of sorting data in DB?
                            
                                Help with Neuroph neural network
                            
                                Speed up python code for computing matrix cofactors
                            
                                performance implications of deep inheritance tree in c++
                            
                                Codeigniter batch insert performance
                            
                                C# Switch Statement: More efficient to not use default?
                            
                                Is there any ARM equivalent of Intel IPP?
                            
                                Why would my parallel code be slower than my serial code?
                            
                                Python: suggestion how to improve to write in streaming text file in Python
                            
                                Find all integer coordinates in a given radius
                            
                                max number of couchbase views per bucket
                            
                                Java Garbage Collection Time?
                            
                                Any way to use >1 Core in PostgreSQL for a single Connection/Query?
                            
                                LINQ vs foreach vs for performance test results
                            
                                Variable name length vs performance
                            
                                What really makes ReactJS as fast as it claims to be?
                            
                                tensorflow code optimization strategy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tensorflow GPU utilization only 60% (GTX 1070)

Tags:

performance

machine-learning

neural-network

tensorflow

nvidia

andre_bauer

People also ask

2 Answers

andre_bauer

Franck Dernoncourt

Recent Activity

Donate For Us