Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When, why, and how to call thread.join() in Python?

I have this python threading code.

import threading

def sum(value):
    sum = 0
    for i in range(value+1):
        sum += i
    print "I'm done with %d - %d\n" % (value, sum)
    return sum

r = range(500001, 500000*2, 100)

ts = []
for u in r:
    t = threading.Thread(target=sum, args = (u,))
    ts.append(t)
    t.start()

for t in ts:
   t.join()

Executing this, I have hundreds of threads are working.

enter image description here

However, when I move the t.join() right after the t.start(), I have only two threads working.

for u in r:
    t = threading.Thread(target=sum, args = (u,))
    ts.append(t)
    t.start()
    t.join()

enter image description here

I tested with the code that does not invoke the t.join(), but it seems to work fine?

Then when, how, and how to use thread.join()?

like image 507
prosseek Avatar asked Dec 19 '22 18:12

prosseek


2 Answers

You seem to not understand what Thread.join does. When calling join, the current thread will block until that thread finished. So you are waiting for the thread to finish, preventing you from starting any other thread.

The idea behind join is to wait for other threads before continuing. In your case, you want to wait for all threads to finish at the end of the main program. Otherwise, if you didn’t do that, and the main program would end, then all threads it created would be killed. So usually, you should have a loop at the end, that joins all created threads to prevent the main thread from exiting down early.

like image 80
poke Avatar answered Dec 22 '22 08:12

poke


Short answer: this one:

for t in ts:
   t.join()

is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.

Longer answer:

len(list(range(500001, 500000*2, 100)))
Out[1]: 5000

You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!

Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!

At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.


Obligatory GIL Discussion

cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.

multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

like image 41
roippi Avatar answered Dec 22 '22 07:12

roippi