Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python code performance decreases with threading

I've written a working program in Python that basically parses a batch of binary files, extracting data into a data structure. Each file takes around a second to parse, which translates to hours for thousands of files. I've successfully implemented a threaded version of the batch parsing method with an adjustable number of threads. I tested the method on 100 files with a varying number of threads, timing each run. Here are the results (0 threads refers to my original, pre-threading code, 1 threads to the new version run with a single thread spawned).

0 threads: 83.842 seconds 1 threads: 78.777 seconds 2 threads: 105.032 seconds 3 threads: 109.965 seconds 4 threads: 108.956 seconds 5 threads: 109.646 seconds 6 threads: 109.520 seconds 7 threads: 110.457 seconds 8 threads: 111.658 seconds 

Though spawning a thread confers a small performance increase over having the main thread do all the work, increasing the number of threads actually decreases performance. I would have expected to see performance increases, at least up to four threads (one for each of my machine's cores). I know threads have associated overhead, but I didn't think this would matter so much with single-digit numbers of threads.

I've heard of the "global interpreter lock", but as I move up to four threads I do see the corresponding number of cores at work--with two threads two cores show activity during parsing, and so on.

I also tested some different versions of the parsing code to see if my program is IO bound. It doesn't seem to be; just reading in the file takes a relatively small proportion of time; processing the file is almost all of it. If I don't do the IO and process an already-read version of a file, I adding a second thread damages performance and a third thread improves it slightly. I'm just wondering why I can't take advantage of my computer's multiple cores to speed things up. Please post any questions or ways I could clarify.

like image 707
dpitch40 Avatar asked Jul 25 '11 19:07

dpitch40


People also ask

Does threading make Python faster?

Multithreading is always faster than serial. Dispatching a cpu heavy task into multiple threads won't speed up the execution. On the contrary it might degrade overall performance. Imagine it like this: if you have 10 tasks and each takes 10 seconds, serial execution will take 100 seconds in total.

Is threading efficient Python?

The threading is efficient in CPython, but threads can not run concurrently on different processors/cores. This is probably what was meant. It only affects you if you need to do shared memory concurrency. Other Python implementations does not have this problem.

Why Python is not good for multithreading?

Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.

Why is Python threading slow?

This is due to the Python GIL being the bottleneck preventing threads from running completely concurrently. The best possible CPU utilisation can be achieved by making use of the ProcessPoolExecutor or Process modules which circumvents the GIL and make code run more concurrently.


2 Answers

This is sadly how things are in CPython, mainly due to the Global Interpreter Lock (GIL). Python code that's CPU-bound simply doesn't scale across threads (I/O-bound code, on the other hand, might scale to some extent).

There is a highly informative presentation by David Beazley where he discusses some of the issues surrounding the GIL. The video can be found here (thanks @Ikke!)

My recommendation would be to use the multiprocessing module instead of multiple threads.

like image 172
NPE Avatar answered Sep 19 '22 19:09

NPE


The threading library does not actually utilize multiple cores simultaneously for computation. You should use the multiprocessing library instead for computational threading.

like image 43
stefan Avatar answered Sep 23 '22 19:09

stefan