Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

I have some string processing job in Python. And I wish to speed up the job by using a thread pool. The string processing job has no dependency to each other. The result will be stored into a mongodb database.

I wrote my code as follow:

thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
    pool.apply_async(_process, [single_string ])
pool.close()
pool.join()

def _process(s):
    # Do staff, pure python string manipulation.
    # Save the output to a database (pyMongo).

I try to run the code in a Linux machine with 8 CPU cores. And it turns out that the maximum CPU usage can only be around 130% (read from top), when I run the job for a few minutes.

Is my approach correct to use a thread pool? Is there any better way to do so?

like image 595
Ivor Zhou Avatar asked Apr 28 '15 04:04

Ivor Zhou


2 Answers

Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:

def _process(s):
    for i in xrange(100000000):
        j = i * i
like image 183
101 Avatar answered Sep 22 '22 03:09

101


You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.

like image 25
RaJa Avatar answered Sep 22 '22 03:09

RaJa