Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocess or threading in python?

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?

I attempted to use threading but had some trouble, often some of the tasks would never actually fire.

like image 545
Ryan Avatar asked Aug 04 '09 09:08

Ryan


People also ask

What's the difference between Python threading and multiprocessing?

Multiprocessing uses two or more CPUs to increase computing power, whereas multithreading uses a single process with multiple code segments to increase computing power. Multithreading focuses on generating computing threads from a single process, whereas multiprocessing increases computing power by adding CPUs.

Is Python good for multi threading?

Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.

When should I use multiprocessing in Python?

CPU time gets rationed out between the threads. Multiprocessing is for times when you really do want more than one thing to be done at any given time. Suppose your application needs to connect to 6 databases and perform a complex matrix transformation on each dataset.

Is multiprocessing faster than threading?

Multiprocessing outshines threading in cases where the program is CPU intensive and doesn't have to do any IO or user interaction. For example, any program that just crunches numbers will see a massive speedup from multiprocessing; in fact, threading will probably slow it down.


2 Answers

If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)

If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.

PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)

Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.

like image 66
Christopher Avatar answered Sep 28 '22 06:09

Christopher


Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.

Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).

Setup multiprocess may be very easy:

from multiprocessing import Pool  def worker(input_item):     output = do_some_work()     return output  pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example. list_of_results = pool.map(worker, input_list) # Launch all automatically 
like image 21
Davide Muzzarelli Avatar answered Sep 28 '22 06:09

Davide Muzzarelli