Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can tensorflow sess.run() really release GIL (global interpreter look) of python?

I want to run multiple train_op in parallel in a tensorflow session. The answer here says that tensorflow sess.run() can release the GIL of python. I try the example in that anwser, but it seems that we still have a GIL. I have 8 GPUs available. When num_threads is 4, it takes 24 seconds. When num_threads is 8, it takes 54 seconds.

Here is the code:

from threading import Thread
import tensorflow as tf
import time

num_threads = 8

a = []
for i in range(num_threads):
    with tf.device('/cpu:0'):
        a.append(tf.get_variable(name='a_%d'%i, shape=[5000, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))

b = []
for i in range(num_threads):
    with tf.device('/cpu:0'):
        b.append(tf.get_variable(name='b_%d'%i, shape=[5000, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))


train_ops = []
for i in range(num_threads):
    with tf.device('gpu:%d'%i):
        loss = tf.multiply(a[i], b[i], name='loss_%d'%i)
        train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))


sess = tf.Session()
sess.run(tf.initialize_all_variables())


def train_function(train_op):
    for i in range(20):
        sess.run(train_op)


train_threads = []
for train_op in train_ops:
    train_threads.append(Thread(target=train_function, args=(train_op,)))

start = time.time()
for t in train_threads:
    t.start()
for t in train_threads:
    t.join()
end = time.time()

print('elapsed time is:', end-start)

my question is whether it is because I did not implement the method correctly. If this way can not release the GIL, then how to release the GIL?

I know distributed tensorflow via gRPC can release the GIL, but gRPC is expensive comparing to multithreading(like pthread in C). I want each thread communicating with each other, and I want to reduce the communication overhead as much as possible. Any answer or hint would be really appreciated!

If there is no way to release GIL, is it possible to write a c++ extension to do multithreading. If not, is it possible to use other language which does not have GIL other than python. Thanks!

like image 995
Luochao Wang Avatar asked May 12 '18 06:05

Luochao Wang


People also ask

Does Tensorflow release GIL?

Tensorflow does release the GIL only when sess. run is called (see this comment). You are calling sess. run from within code that is restricted by the GIL; Therefore sess.

Does Python use real threads if it uses a global interpreter lock?

We can not achieve multithreading in python because we have global interpreter lock which restricts the threads and works as a single thread.

Why does CPython have GIL?

The GIL provides an important simplifying model of object access (including refcount manipulation) because it ensures that only one thread of execution can mutate Python objects at a time5. There are important performance benefits of the GIL for single-threaded operations as well.


1 Answers

Tensorflow does release the GIL only when sess.run is called (see this comment). You are calling sess.run from within code that is restricted by the GIL; Therefore sess.run is called on each training op sequentially. I believe that the GIL releasing is intended for interactions with tf.py_func.

What you are trying to accomplish is already implemented by tensorflow with hardly any extra code. Tensorflow already launches kernels on different devices concurrently.

Your code also has a huge inefficiency, that you store the weights on the CPU. This is a huge bottleneck. Every iteration the weights are being copied to each GPU and the gradients copied back to the CPU where they are updated (i.e. update happens on the CPU!). When you increase the number of GPUs involved you multiply the number of copies and the CPU update time grows linearly.

I fixed your code to follow the best practices:

import tensorflow as tf
import time

num_threads = 1

n = 5000

a = []
for i in range(num_threads):
    #store each variable one the device that it will be used on
    with tf.device('gpu:%d'%i):
        a.append(tf.get_variable(name='a_%d'%i, shape=[n, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))

b = []
for i in range(num_threads):
    with tf.device('gpu:%d'%i):
        b.append(tf.get_variable(name='b_%d'%i, shape=[n, 50, 5, 5, 5, 5], initializer=tf.truncated_normal_initializer()))


train_ops = []
for i in range(num_threads):
    #now when a and b are accessed when the graph is executed
    #the variables will already be in VRAM
    with tf.device('gpu:%d'%i):
        loss = tf.multiply(a[i], b[i], name='loss_%d'%i)
        train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))

sess = tf.Session()

sess.run(tf.initialize_all_variables())

#dry run
sess.run(train_ops)

start = time.time()
for i in range(200):
    sess.run(train_ops)
end = time.time()

print('elapsed time is:', end-start)

The runtimes I now get are 3.67962 and 3.64852 for 1 and 2 GPU runs with 200 iterations instead of 20. I only have access to 2 GPUs so I couldn't test on 4, but the result should be the same.

You can read more about how to use tensorflow with multiple GPUs on their website. Notice that I also included a dry run. This is required in tensorflow becuase the first call to sess.run allocates memory on each GPU. This means that the more GPUs you have the more time the first call is, so it should be ignored.

like image 63
McAngus Avatar answered Oct 09 '22 06:10

McAngus