Can tensorflow sess.run() really release GIL (global interpreter look) of python?

Tags:

I want to run multiple train_op in parallel in a tensorflow session. The answer here says that tensorflow sess.run() can release the GIL of python. I try the example in that anwser, but it seems that we still have a GIL. I have 8 GPUs available. When num_threads is 4, it takes 24 seconds. When num_threads is 8, it takes 54 seconds.

Here is the code:

from threading import Thread
import tensorflow as tf
import time

num_threads = 8

a = []
for i in range(num_threads):
    with tf.device('/cpu:0'):
        a.append(tf.get_variable(name='a_%d'%i, shape=[5000, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))

b = []
for i in range(num_threads):
    with tf.device('/cpu:0'):
        b.append(tf.get_variable(name='b_%d'%i, shape=[5000, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))


train_ops = []
for i in range(num_threads):
    with tf.device('gpu:%d'%i):
        loss = tf.multiply(a[i], b[i], name='loss_%d'%i)
        train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))


sess = tf.Session()
sess.run(tf.initialize_all_variables())


def train_function(train_op):
    for i in range(20):
        sess.run(train_op)


train_threads = []
for train_op in train_ops:
    train_threads.append(Thread(target=train_function, args=(train_op,)))

start = time.time()
for t in train_threads:
    t.start()
for t in train_threads:
    t.join()
end = time.time()

print('elapsed time is:', end-start)

my question is whether it is because I did not implement the method correctly. If this way can not release the GIL, then how to release the GIL?

I know distributed tensorflow via gRPC can release the GIL, but gRPC is expensive comparing to multithreading(like pthread in C). I want each thread communicating with each other, and I want to reduce the communication overhead as much as possible. Any answer or hint would be really appreciated!

If there is no way to release GIL, is it possible to write a c++ extension to do multithreading. If not, is it possible to use other language which does not have GIL other than python. Thanks!

995

asked May 12 '18 06:05

Luochao Wang

1 Answers

Tensorflow does release the GIL only when sess.run is called (see this comment). You are calling sess.run from within code that is restricted by the GIL; Therefore sess.run is called on each training op sequentially. I believe that the GIL releasing is intended for interactions with tf.py_func.

What you are trying to accomplish is already implemented by tensorflow with hardly any extra code. Tensorflow already launches kernels on different devices concurrently.

Your code also has a huge inefficiency, that you store the weights on the CPU. This is a huge bottleneck. Every iteration the weights are being copied to each GPU and the gradients copied back to the CPU where they are updated (i.e. update happens on the CPU!). When you increase the number of GPUs involved you multiply the number of copies and the CPU update time grows linearly.

I fixed your code to follow the best practices:

import tensorflow as tf
import time

num_threads = 1

n = 5000

a = []
for i in range(num_threads):
    #store each variable one the device that it will be used on
    with tf.device('gpu:%d'%i):
        a.append(tf.get_variable(name='a_%d'%i, shape=[n, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))

b = []
for i in range(num_threads):
    with tf.device('gpu:%d'%i):
        b.append(tf.get_variable(name='b_%d'%i, shape=[n, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))


train_ops = []
for i in range(num_threads):
    #now when a and b are accessed when the graph is executed
    #the variables will already be in VRAM
    with tf.device('gpu:%d'%i):
        loss = tf.multiply(a[i], b[i], name='loss_%d'%i)
        train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))

sess = tf.Session()

sess.run(tf.initialize_all_variables())

#dry run
sess.run(train_ops)

start = time.time()
for i in range(200):
    sess.run(train_ops)
end = time.time()

print('elapsed time is:', end-start)

The runtimes I now get are 3.67962 and 3.64852 for 1 and 2 GPU runs with 200 iterations instead of 20. I only have access to 2 GPUs so I couldn't test on 4, but the result should be the same.

You can read more about how to use tensorflow with multiple GPUs on their website. Notice that I also included a dry run. This is required in tensorflow becuase the first call to sess.run allocates memory on each GPU. This means that the more GPUs you have the more time the first call is, so it should be ignored.

answered Oct 09 '22 06:10

McAngus

Related questions
                            
                                Adding a preprocessing layer to keras model and setting tensor values
                            
                                marisa trie suffix compression?
                            
                                Extending socket.socket with a new attribute
                            
                                Override accepted renderer in django-rest-framework on exception
                            
                                Cannot use line_profiler with Cython
                            
                                Computing Shannon entropy of a HTTP header using Python. How to do it?
                            
                                How to keep environment variables for remote Python interpreter with PyCharm
                            
                                edge length in networkx
                            
                                Python Selenium: wait until an element is no longer stale?
                            
                                Distinguishing multiple exit points in loop
                            
                                Python access to project root directory
                            
                                How to save a file on the cluster
                            
                                3D image rotation in python
                            
                                Python: Extract Metadata from PNG
                            
                                Multi-output regression
                            
                                Print layer outputs in Keras during training
                            
                                Compute gradients for each time step of tf.while_loop
                            
                                How do you resolve 'hidden imports not found!' warnings in pyinstaller for scipy?
                            
                                Python check for corrupted video file (catch OpenCV error)
                            
                                Unicode version for python and debian 9

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can tensorflow sess.run() really release GIL (global interpreter look) of python?

Tags:

python

tensorflow

Luochao Wang

People also ask

1 Answers

McAngus

Recent Activity

Donate For Us