TensorFlow seems not to use GPU

Question

I use TensorFlow on Windows 8 and Python 3.5. I changed this short example to see, if the GPU support (Titan X) works. Unfortunately the runtime with (tf.device("/gpu:0") and without (tf.device("/cpu:0")) using the GPU is the same. The Windows CPU monitoring shows that in both cases the CPU load is about 100% during the computation.

This is the code example:

import numpy as np
import tensorflow as tf
import datetime

#num of multiplications to perform
n = 100

# Create random large matrix
matrix_size = 1e3
A = np.random.rand(matrix_size, matrix_size).astype('float32')
B = np.random.rand(matrix_size, matrix_size).astype('float32')

# Creates a graph to store results
c1 = []

# Define matrix power
def matpow(M, n):
    if n < 1: #Abstract cases where n < 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

with tf.device("/gpu:0"):
    a = tf.constant(A)
    b = tf.constant(B)
    #compute A^n and B^n and store results in c1
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

    sum = tf.add_n(c1)

t1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    # Runs the op.
    sess.run(sum)
t2 = datetime.datetime.now()

print("computation time: " + str(t2-t1))

And here is the output for the GPU case:

I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library cublas64_80.dll locally
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library cudnn64_5.dll locally
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library cufft64_80.dll locally
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library nvcuda.dll locally
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library curand64_80.dll locally
C:/Users/schlichting/.spyder-py3/temp.py:16: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  A = np.random.rand(matrix_size, matrix_size).astype('float32')
C:/Users/schlichting/.spyder-py3/temp.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  B = np.random.rand(matrix_size, matrix_size).astype('float32')
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:01:00.0
Total memory: 12.00GiB
Free memory: 2.40GiB
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\gpu\gpu_device.cc:906] DMA: 0 
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\gpu\gpu_device.cc:916] 0:   Y 
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\gpu\gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0)
D c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\direct_session.cc:255] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0

Ievice mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0

C:0/task:0/gpu:0
host/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_108: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_109: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_110: (MatMul)/job:localhost/replicacalhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_107: (MatMul)/job:localgpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_103: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_104: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_105: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_106: (MatMul)/job:lo c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] Const_1: (Const)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_100: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_101: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:	f_jenkins\home\workspace
elease-win\device\gpu\os\windows	ensorflow\core\common_runtime\simple_placer.cc:827] MatMul_102: (MatMul)/job:localhost/replica:0/task:0/Ionst_1: (Const): /job:localhost/replica:0/task:0/gpu:0


MatMul_100: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_101: (MatMul): /job:localhost/replica:0/task:0/gpu:0
...
MatMul_198: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_199: (MatMul): /job:localhost/replica:0/task:0/gpu:0
Const: (Const): /job:localhost/replica:0/task:0/gpu:0
MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_1: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_2: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_3: (MatMul): /job:localhost/replica:0/task:0/gpu:0
...
MatMul_98: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_99: (MatMul): /job:localhost/replica:0/task:0/gpu:0
AddN: (AddN): /job:localhost/replica:0/task:0/gpu:0
computation time: 0:00:05.066000

In case of CPU the output is the same, with cpu:0 instead of gpu:0. The computation time doesn't change. Even I use more operations, e.g. with a runtime of about 1 minute, the GPU and CPU is equal. Many thanks in advance!

sygi · Accepted Answer

As per log info, in particular device placement, your code uses GPU. Just the time to run is the same. My guess is that:

c1.append(matpow(a, n))
c1.append(matpow(b, n))

Is the bottleneck in your code, moving big matrices from GPU memory to RAM on and on. Can you try to:

change the matrix size to 1e4 x 1e4

with tf.device("/gpu:0"):
  A = tf.random_normal([matrix_size, matrix_size])
  B = tf.random_normal([matrix_size, matrix_size])
  C = tf.matmul(A, B)
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
  t1 = datetime.datetime.now()
  sess.run(C)
  t2 = datetime.datetime.now()

chasep255 · Answer

Say for instance creating the tensorflow session takes 4.9 seconds and the actual calculations only takes 0.1 on the cpu giving you a time of 5.0 seconds on the cpu. Now say creating the session on the gpu also takes 4.9 seconds but the calculation takes 0.01 seconds giving a time of 4.91 seconds. You would hardly see the difference. Creating the session is a one time overhead cost at the startup of a program. You should not include that in your timing. Also tensorflow does some compilation/optimization when you call sess.run for the first time which makes the first run even slower.

Try timing it like this.

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    # Runs the op the first time.
    sess.run(sum)
    t1 = datetime.datetime.now()
    for i in range(1000):
        sess.run(sum)
    t2 = datetime.datetime.now()

If this doesn't fix it it might also be that your calculation does not allow for enough parallelism for the GPU to really beat the cpu. Increasing the matrix size might bring out the differences.

TensorFlow seems not to use GPU

Tags:

python

tensorflow

gpu

python-3.5

tensorflow-gpu

user3641158

2 Answers

sygi

chasep255

Recent Activity

Donate For Us

TensorFlow seems not to use GPU

Tags:

python

tensorflow

gpu

python-3.5

tensorflow-gpu

user3641158

2 Answers

sygi

chasep255

Related questions

Recent Activity

Donate For Us