Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this Python script run 4x slower on multiple cores than on a single core

I'm trying to understand how CPython's GIL works and what are the differences between GIL in CPython 2.7.x and CPython 3.4.x. I'm using this code for benchmarking:

from __future__ import print_function

import argparse
import resource
import sys
import threading
import time


def countdown(n):
    while n > 0:
        n -= 1


def get_time():
    stats = resource.getrusage(resource.RUSAGE_SELF)
    total_cpu_time = stats.ru_utime + stats.ru_stime
    return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime


def get_time_diff(start_time, end_time):
    return tuple((end-start) for start, end in zip(start_time, end_time))


def main(total_cycles, max_threads, no_headers=False):
    header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
              ("#t", "seq_r", "seq_c", "seq_u", "seq_s",
               "par_r", "par_c", "par_u", "par_s"))
    row_format = ("%(threads)4d "
                  "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
                  "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
    if not no_headers:
        print(header)
    for thread_count in range(1, max_threads+1):
        # We don't care about a few lost cycles
        cycles = total_cycles // thread_count

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]

        start_time = get_time()
        for thread in threads:
            thread.start()
            thread.join()
        end_time = get_time()
        sequential = get_time_diff(start_time, end_time)

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]
        start_time = get_time()
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        end_time = get_time()
        parallel = get_time_diff(start_time, end_time)

        print(row_format % {"threads": thread_count,
                            "seq_r": sequential[0],
                            "seq_c": sequential[1],
                            "seq_u": sequential[2],
                            "seq_s": sequential[3],
                            "par_r": parallel[0],
                            "par_c": parallel[1],
                            "par_u": parallel[2],
                            "par_s": parallel[3]})


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("max_threads", nargs="?",
                            type=int, default=5)
    arg_parser.add_argument("total_cycles", nargs="?",
                            type=int, default=50000000)
    arg_parser.add_argument("--no-headers",
                            action="store_true")
    args = arg_parser.parse_args()
    sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

When running this script on my quad-core i5-2500 machine under Ubuntu 14.04 with Python 2.7.6, I get the following results (_r stands for real time, _c for CPU time, _u for user mode, _s for kernel mode):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93
   3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65
   4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30
   5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

Now if I bind all threads to one core, the results are very different:

taskset -c 0 python countdown.py 
  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00
   3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04
   4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15
   5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

So the question is: why running this Python code on multiple cores is 1.5x-2x slower by wall clock and 4x-5x slower by CPU clock than running it on a single core?

Asking around and googling produced two hypotheses:

  1. When running on multiple cores, a thread can be re-scheduled to run on a different core which means that local cache gets invalidated, hence the slowdown.
  2. The overhead of suspending a thread on one core and activating it on another core is larger than suspending and activating the thread on the same core.

Are there any other reasons? I would like to understand what's going on and to be able to back my understanding with numbers (meaning that if the slowdown is due to cache misses, I want to see and compare the numbers for both cases).

like image 679
user108884 Avatar asked Jul 13 '15 14:07

user108884


People also ask

Can Python take advantage of multiple cores?

Key Takeaways. Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.

How many CPU cores will Python threading take advantage of simultaneously?

How many CPUs (or cores) will the Python threading library take advantage of simultaneously? Python threading is restricted to a single CPU at one time. The multiprocessing library will allow you to run code on different processors. Python threading is restricted to a single CPU at one time.

Can NumPy use multiple cores?

When you're doing computationally intensive calculations with NumPy, you'll want to use all your computer's CPUs. Your computer has 2 or 4 or even more CPU cores, and if you can use them all then your code will run faster.


2 Answers

It's due to GIL thrashing when multiple native threads are competing for the GIL. David Beazley's materials on this subject will tell your everything you want to know.

See info here for a nice graphical representation of what is happening.

Python3.2 introduced changes to the GIL that help solve this problem so you should see improved performance with 3.2 and later.

It should also be noted that the GIL is an implementation detail of the cpython reference implementation of the language. Other implementations like Jython do not have GIL and do not suffer this particular problem.

The rest of D. Beazley's info on the GIL will also be helpful to you.

To specifically answer your question about why performance is so much worse when multiple cores are involved, see slide 29-41 of the Inside the GIL presentation. It goes into a detailed discussion on multicore GIL contention as opposed to multiple threads on a single core. Slide 32 specifically shows that the number of system calls due to thread signaling overhead goes through the roof as you add cores. This is because the threads are now running simulatneously on different cores and which allows them to engage in a true GIL battle. As opposed to multiple threads sharing a single CPU. A good summary bullet from the above presentation is:

With multiple cores, CPU-bound threads get scheduled simultaneously (on different cores) and then have a GIL battle.

like image 61
mshildt Avatar answered Sep 28 '22 05:09

mshildt


The GIL prevents several python threads to run concurrently. That means whenever one thread needs to execute Python bytecode (the internal representation of the code), it will acquire the lock (effectively stopping the other threads on the other cores). For this to work, the CPU needs to flush all cache lines. Otherwise, the active thread would operate on stale data.

When you run the threads on a single CPU, no cache flush is necessary.

This should account for most of the slowdown. If you want to run Python code in parallel, you need to use processes and IPC (sockets, semaphores, memory mapped IO). But that can be slow for different reasons (memory must be copied between processes).

Another approach is move more code in a C library which doesn't hold the GIL while it executes. That would allow to execute more code in parallel.

like image 21
Aaron Digulla Avatar answered Sep 28 '22 03:09

Aaron Digulla