I'm trying to understand how CPython's GIL works and what are the differences between GIL in CPython 2.7.x and CPython 3.4.x. I'm using this code for benchmarking:
from __future__ import print_function
import argparse
import resource
import sys
import threading
import time
def countdown(n):
while n > 0:
n -= 1
def get_time():
stats = resource.getrusage(resource.RUSAGE_SELF)
total_cpu_time = stats.ru_utime + stats.ru_stime
return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime
def get_time_diff(start_time, end_time):
return tuple((end-start) for start, end in zip(start_time, end_time))
def main(total_cycles, max_threads, no_headers=False):
header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
("#t", "seq_r", "seq_c", "seq_u", "seq_s",
"par_r", "par_c", "par_u", "par_s"))
row_format = ("%(threads)4d "
"%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
"%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
if not no_headers:
print(header)
for thread_count in range(1, max_threads+1):
# We don't care about a few lost cycles
cycles = total_cycles // thread_count
threads = [threading.Thread(target=countdown, args=(cycles,))
for i in range(thread_count)]
start_time = get_time()
for thread in threads:
thread.start()
thread.join()
end_time = get_time()
sequential = get_time_diff(start_time, end_time)
threads = [threading.Thread(target=countdown, args=(cycles,))
for i in range(thread_count)]
start_time = get_time()
for thread in threads:
thread.start()
for thread in threads:
thread.join()
end_time = get_time()
parallel = get_time_diff(start_time, end_time)
print(row_format % {"threads": thread_count,
"seq_r": sequential[0],
"seq_c": sequential[1],
"seq_u": sequential[2],
"seq_s": sequential[3],
"par_r": parallel[0],
"par_c": parallel[1],
"par_u": parallel[2],
"par_s": parallel[3]})
if __name__ == "__main__":
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument("max_threads", nargs="?",
type=int, default=5)
arg_parser.add_argument("total_cycles", nargs="?",
type=int, default=50000000)
arg_parser.add_argument("--no-headers",
action="store_true")
args = arg_parser.parse_args()
sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))
When running this script on my quad-core i5-2500 machine under Ubuntu 14.04 with Python 2.7.6, I get the following results (_r stands for real time, _c for CPU time, _u for user mode, _s for kernel mode):
#t seq_r seq_c seq_u seq_s par_r par_c par_u par_s
1 1.47 1.47 1.47 0.00 1.46 1.46 1.46 0.00
2 1.74 1.74 1.74 0.00 3.33 5.45 3.52 1.93
3 1.87 1.90 1.90 0.00 3.08 6.42 3.77 2.65
4 1.78 1.83 1.83 0.00 3.73 6.18 3.88 2.30
5 1.73 1.79 1.79 0.00 3.74 6.26 3.87 2.39
Now if I bind all threads to one core, the results are very different:
taskset -c 0 python countdown.py
#t seq_r seq_c seq_u seq_s par_r par_c par_u par_s
1 1.46 1.46 1.46 0.00 1.46 1.46 1.46 0.00
2 1.74 1.74 1.73 0.00 1.69 1.68 1.68 0.00
3 1.47 1.47 1.47 0.00 1.58 1.58 1.54 0.04
4 1.74 1.74 1.74 0.00 2.02 2.02 1.87 0.15
5 1.46 1.46 1.46 0.00 1.91 1.90 1.75 0.15
So the question is: why running this Python code on multiple cores is 1.5x-2x slower by wall clock and 4x-5x slower by CPU clock than running it on a single core?
Asking around and googling produced two hypotheses:
Are there any other reasons? I would like to understand what's going on and to be able to back my understanding with numbers (meaning that if the slowdown is due to cache misses, I want to see and compare the numbers for both cases).
Key Takeaways. Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.
How many CPUs (or cores) will the Python threading library take advantage of simultaneously? Python threading is restricted to a single CPU at one time. The multiprocessing library will allow you to run code on different processors. Python threading is restricted to a single CPU at one time.
When you're doing computationally intensive calculations with NumPy, you'll want to use all your computer's CPUs. Your computer has 2 or 4 or even more CPU cores, and if you can use them all then your code will run faster.
It's due to GIL thrashing when multiple native threads are competing for the GIL. David Beazley's materials on this subject will tell your everything you want to know.
See info here for a nice graphical representation of what is happening.
Python3.2 introduced changes to the GIL that help solve this problem so you should see improved performance with 3.2 and later.
It should also be noted that the GIL is an implementation detail of the cpython reference implementation of the language. Other implementations like Jython do not have GIL and do not suffer this particular problem.
The rest of D. Beazley's info on the GIL will also be helpful to you.
To specifically answer your question about why performance is so much worse when multiple cores are involved, see slide 29-41 of the Inside the GIL presentation. It goes into a detailed discussion on multicore GIL contention as opposed to multiple threads on a single core. Slide 32 specifically shows that the number of system calls due to thread signaling overhead goes through the roof as you add cores. This is because the threads are now running simulatneously on different cores and which allows them to engage in a true GIL battle. As opposed to multiple threads sharing a single CPU. A good summary bullet from the above presentation is:
With multiple cores, CPU-bound threads get scheduled simultaneously (on different cores) and then have a GIL battle.
The GIL prevents several python threads to run concurrently. That means whenever one thread needs to execute Python bytecode (the internal representation of the code), it will acquire the lock (effectively stopping the other threads on the other cores). For this to work, the CPU needs to flush all cache lines. Otherwise, the active thread would operate on stale data.
When you run the threads on a single CPU, no cache flush is necessary.
This should account for most of the slowdown. If you want to run Python code in parallel, you need to use processes and IPC (sockets, semaphores, memory mapped IO). But that can be slow for different reasons (memory must be copied between processes).
Another approach is move more code in a C library which doesn't hold the GIL while it executes. That would allow to execute more code in parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With