Why does this Python script run 4x slower on multiple cores than on a single core

Tags:

I'm trying to understand how CPython's GIL works and what are the differences between GIL in CPython 2.7.x and CPython 3.4.x. I'm using this code for benchmarking:

from __future__ import print_function

import argparse
import resource
import sys
import threading
import time


def countdown(n):
    while n > 0:
        n -= 1


def get_time():
    stats = resource.getrusage(resource.RUSAGE_SELF)
    total_cpu_time = stats.ru_utime + stats.ru_stime
    return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime


def get_time_diff(start_time, end_time):
    return tuple((end-start) for start, end in zip(start_time, end_time))


def main(total_cycles, max_threads, no_headers=False):
    header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
              ("#t", "seq_r", "seq_c", "seq_u", "seq_s",
               "par_r", "par_c", "par_u", "par_s"))
    row_format = ("%(threads)4d "
                  "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
                  "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
    if not no_headers:
        print(header)
    for thread_count in range(1, max_threads+1):
        # We don't care about a few lost cycles
        cycles = total_cycles // thread_count

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]

        start_time = get_time()
        for thread in threads:
            thread.start()
            thread.join()
        end_time = get_time()
        sequential = get_time_diff(start_time, end_time)

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]
        start_time = get_time()
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        end_time = get_time()
        parallel = get_time_diff(start_time, end_time)

        print(row_format % {"threads": thread_count,
                            "seq_r": sequential[0],
                            "seq_c": sequential[1],
                            "seq_u": sequential[2],
                            "seq_s": sequential[3],
                            "par_r": parallel[0],
                            "par_c": parallel[1],
                            "par_u": parallel[2],
                            "par_s": parallel[3]})


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("max_threads", nargs="?",
                            type=int, default=5)
    arg_parser.add_argument("total_cycles", nargs="?",
                            type=int, default=50000000)
    arg_parser.add_argument("--no-headers",
                            action="store_true")
    args = arg_parser.parse_args()
    sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

When running this script on my quad-core i5-2500 machine under Ubuntu 14.04 with Python 2.7.6, I get the following results (_r stands for real time, _c for CPU time, _u for user mode, _s for kernel mode):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93
   3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65
   4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30
   5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

Now if I bind all threads to one core, the results are very different:

taskset -c 0 python countdown.py 
  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00
   3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04
   4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15
   5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

So the question is: why running this Python code on multiple cores is 1.5x-2x slower by wall clock and 4x-5x slower by CPU clock than running it on a single core?

Asking around and googling produced two hypotheses:

When running on multiple cores, a thread can be re-scheduled to run on a different core which means that local cache gets invalidated, hence the slowdown.
The overhead of suspending a thread on one core and activating it on another core is larger than suspending and activating the thread on the same core.

Are there any other reasons? I would like to understand what's going on and to be able to back my understanding with numbers (meaning that if the slowdown is due to cache misses, I want to see and compare the numbers for both cases).

679

asked Jul 13 '15 14:07

user108884

2 Answers

It's due to GIL thrashing when multiple native threads are competing for the GIL. David Beazley's materials on this subject will tell your everything you want to know.

See info here for a nice graphical representation of what is happening.

Python3.2 introduced changes to the GIL that help solve this problem so you should see improved performance with 3.2 and later.

It should also be noted that the GIL is an implementation detail of the cpython reference implementation of the language. Other implementations like Jython do not have GIL and do not suffer this particular problem.

The rest of D. Beazley's info on the GIL will also be helpful to you.

To specifically answer your question about why performance is so much worse when multiple cores are involved, see slide 29-41 of the Inside the GIL presentation. It goes into a detailed discussion on multicore GIL contention as opposed to multiple threads on a single core. Slide 32 specifically shows that the number of system calls due to thread signaling overhead goes through the roof as you add cores. This is because the threads are now running simulatneously on different cores and which allows them to engage in a true GIL battle. As opposed to multiple threads sharing a single CPU. A good summary bullet from the above presentation is:

With multiple cores, CPU-bound threads get scheduled simultaneously (on different cores) and then have a GIL battle.

answered Sep 28 '22 05:09

mshildt

The GIL prevents several python threads to run concurrently. That means whenever one thread needs to execute Python bytecode (the internal representation of the code), it will acquire the lock (effectively stopping the other threads on the other cores). For this to work, the CPU needs to flush all cache lines. Otherwise, the active thread would operate on stale data.

When you run the threads on a single CPU, no cache flush is necessary.

This should account for most of the slowdown. If you want to run Python code in parallel, you need to use processes and IPC (sockets, semaphores, memory mapped IO). But that can be slow for different reasons (memory must be copied between processes).

Another approach is move more code in a C library which doesn't hold the GIL while it executes. That would allow to execute more code in parallel.

answered Sep 28 '22 03:09

Aaron Digulla

Related questions
                            
                                'NoneType' object has no attribute 'unique' BUT I'm not using the 'unique' attribute
                            
                                How to implement a memory intensive python script for test
                            
                                matplotlib fill_between does not cycle through colours
                            
                                How to find a list in docx using python?
                            
                                API key encryption for Github?
                            
                                Setup and teardown functions executed once for all nosetests tests
                            
                                Celery/RabbitMQ unacked messages blocking queue?
                            
                                AttributeError: 'str' object has no attribute 'decode'
                            
                                How to redirect python flask werkzeug logs to a log file when using TimedRotatingHandler?
                            
                                Different login views in Pyramid
                            
                                Can't call result() on futures in tornado
                            
                                How to connect to socket.io using python
                            
                                Sending ASCII Command using PySerial
                            
                                How does Python access OS API functions such as socket()?
                            
                                How can I read successive arrays from a binary file using `np.fromfile`?
                            
                                When it is necessary to close a file and when it is not in python?
                            
                                How to use Event objects in python with a thread?
                            
                                Any pyinstaller detailed example about hidden import for psutil?
                            
                                Python Mechanize log into Facebook cookie error
                            
                                Python Selenium create loop to click through links on a page and press button on each new page

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does this Python script run 4x slower on multiple cores than on a single core

Tags:

python

multithreading

cpython

gil

user108884

People also ask

2 Answers

mshildt

Aaron Digulla

Recent Activity

Donate For Us