Python multiprocessing queues slower than pool.map

Tags:

I recently started experimenting with multiprocessing to speed up a task. I created a script that does fuzzy string matching and calculates scores using different algorithms (I wanted to compare different matching techniques). You can find the full source here: https://bitbucket.org/bergonzzi/fuzzy-compare/src. As input it takes 2 files which are combined into pairs (each line of file1 with each line of file2). For each pair, fuzzy match scores are calculated.

I made 3 versions. Running with the sample data provided in my repo (which consists of 697.340 items after being combined into pairs), I have the following timings:

Simple single process - 0:00:47
Multiprocess using Pool.map() - 0:00:13
Multiprocess using Queues (producer/consumer pattern) - 0:01:04

I'm trying to understand why my Pool.map() version is much faster than my Queue version, which is actually slower than the simple single-process one.

My reasoning for even attempting to using Queues is that the Pool.map() version holds on to the results until everything's finished and only writes to a file at the end. This means that for big files it ends up eating a lot of memory. I'm talking about this version (linking to it because it's a lot of code to paste here).

To solve this I refactored it into a producer/consumer pattern (or attempted at least). Here I first produce jobs by combining both input files and put them in a queue which the consumers process (calculate fuzzy match scores). Done jobs are put into an out queue. Then I have a single process grabbing done items from this queue and writing them to a file. This way, in theory, I wouldn't need as much memory since results would be flushed out to disk. It seems to work fine but it's much slower. I also noticed that the 4 processes I'm spawning don't seem to use up 100% CPU when looking at the Activity Monitor on Mac OSX (which is not the case with the Pool.map() version).

Another thing I notice is that my producer function seems to fill up the queue properly but the consumer processes seem to wait until the queue is filled up instead of starting to work as soon as the first item arrives. I'm probably doing something wrong there...

For reference here's some of the relevant code for the Queue version (although it's better to look at the full code in the repo linked above).

Here's my producer function:

def combine(list1, list2):
    '''
    Combine every item of list1 with every item of list 2,
    normalize put the pair in the job queue.
    '''
    pname = multiprocessing.current_process().name
    for x in list1:
        for y in list2:
            # slugify is a function to normalize the strings
            term1 = slugify(x.strip(), separator=' ')
            term2 = slugify(y.strip(), separator=' ')
            job_queue.put_nowait([term1, term2])

This is the writer function:

def writer(writer_queue):
    out = open(file_out, 'wb')
    pname = multiprocessing.current_process().name
    out.write(header)
    for match in iter(writer_queue.get, "STOP"):
        print("%s is writing %s") % (pname, str(match))
        line = str(';'.join(match) + '\n')
        out.write(line)
    out.close()

This is the worker function that does the actual calculations (stripped out most of the code since it doesn't make a difference here, full source on the repo):

def score_it(job_queue, writer_queue):
    '''Calculate scores for pair of words.'''
    pname = multiprocessing.current_process().name

    for pair in iter(job_queue.get_nowait, "STOP"):
        # do all the calculations and put the result into the writer queue
        writer_queue.put(result)

This is how I set up the processes:

# Files
to_match = open(args.file_to_match).readlines()
source_list = open(args.file_to_be_matched).readlines()

workers = 4
job_queue = multiprocessing.Manager().Queue()
writer_queue = multiprocessing.Manager().Queue()
processes = []

print('Start matching with "%s", minimum score of %s and %s workers') % (
    args.algorithm, minscore, workers)

# Fill up job queue
print("Filling up job queue with term pairs...")
c = multiprocessing.Process(target=combine, name="Feeder", args=(to_match, source_list))
c.start()
c.join()

print("Job queue size: %s") % job_queue.qsize()

# Start writer process
w = multiprocessing.Process(target=writer, name="Writer", args=(writer_queue,))
w.start()

for w in xrange(workers):
    p = multiprocessing.Process(target=score_it, args=(job_queue, writer_queue))
    p.start()
    processes.append(p)
    job_queue.put("STOP")

for p in processes:
    p.join()

writer_queue.put("STOP")

I've read quite a bit here about multiprocessing being slower sometimes and I know this has to do with the overhead of creating and managing new processes. Also when the job to be done isn't "big" enough, the effect of multiprocessing might not be visible. However in this case I think the job's quite big and also the Pool.map() version seems to prove it because it's much faster.

Am I doing something really wrong when managing all these processes and passing over queue objects? How can this be optimised so that results can be written to a file as they are processed in order to minimise the amount of memory required while running it?

Thanks!

794

asked Nov 15 '14 01:11

bergonzzi

1 Answers

I think the issue with your timings is your multithreaded-queue version is missing an optimization. You made a comment essentially saying that your job_queue fills up before the worker threads start taking jobs from it. I believe the reason for this is the c.join() you have in #Fill up job queue. This prevents the main thread from continuing until the job queue is full. I'd move the c.join() to the end after the p.join()'s. You'll also need to figure out a way to get your stop flags into the end of the queue. The combine function might be a good place to put this. Something along the lines of adding x number of stop flags after it's run out of data to combine.

One other thing to note: You're writing over you w variable within the scope of your for loop that kicks off the p processes. As a matter of style/readability/etc, I'd change w to a different variable name. If you're not using it, an underscore works as a good throwaway variable name. I.e.

for w in xrange(workers):

should become

for _ in xrange(workers):

Long story short, if you move the c.join() to the end, you should get more accurate timings. Currently, the only thing that's multithreaded is the fuzzy matching of strings. One of the advantages of having a producer/consumer thread is the consumer threads don't have to wait until the producer thread is finished, and thus, you end up using less memory.

162

answered Oct 12 '22 09:10

Bryan Lott

Related questions
                            
                                Py.test collection phase taking very long
                            
                                Defining new semantics for expressions in Python
                            
                                Python Pandas Multiprocessing Apply
                            
                                c++ python API : second call of PyImport_Import results in SIGSEGV
                            
                                NumPy array to bounded by 0 and 1?
                            
                                Opening a postgres connection in psycopg2 causes python to crash
                            
                                Pandas groupby date range
                            
                                In Python, how can I wait until all items in multiple Queues are processed?
                            
                                Error in .vdisplay.start() - xvfbwrapper is not working
                            
                                Django: Is it efficient to save queryset items in loop?
                            
                                How to solve the loading error of clang's Python binding?
                            
                                Pylint: disable warning for subclass
                            
                                Getting strange connection aborted errors with python requests
                            
                                How to traverse Linked-Lists Python
                            
                                GDB pretty-printing: returning string from a children()'s iterator, but displayed as a char[]
                            
                                python class inheritance tree
                            
                                why numpy narray read from file consumes so much memory?
                            
                                python ipow : how to use the third argument?
                            
                                Send email with attachment from Kivy app on Android, preferably by opening email client
                            
                                joining only lines with spaces in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python multiprocessing queues slower than pool.map

Tags:

performance

python

multiprocessing

python-multiprocessing

bergonzzi

People also ask

1 Answers

Bryan Lott

Recent Activity

Donate For Us