Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiprocessing module for updating a shared dictionary in Python

I am creating a dictionary as follows:

y=[(1,2),(2,3),(1,2),(5,6)]

dict={}

for tup in y:
    tup=tuple(sorted(tup))
    if tup in dict.keys():
        dict[tup]=dict[tup]+1
    else:
        dict[tup]=1

However my actual y contains about 40 million tuples, is there a way to use the multiprocessing to speed up this process?

Thanks

like image 806
laila Avatar asked Dec 11 '15 10:12

laila


People also ask

What is multiprocessing in Python?

Python Multiprocessing Package Multiprocessing in Python is a package we can use with Python to spawn processes using an API that is much like the threading module. With support for both local and remote concurrency, it lets the programmer make efficient use of multiple processors on a given machine.

What is shared memory in Python?

Source code: Lib/multiprocessing/shared_memory.py New in version 3.8. This module provides a class, SharedMemory, for the allocation and management of shared memory to be accessed by one or more processes on a multicore or symmetric multiprocessor (SMP) machine.

What is the current_process () method in Python?

As you can see, the current_process () method gives us the name of the process that calls our function. See what happens when we don’t assign a name to one of the processes: Well, the Python Multiprocessing Module assigns a number to each process as a part of its name when we don’t.

What is the sharedmemorymanager subclass?

To assist with the life-cycle management of shared memory especially across distinct processes, a BaseManager subclass, SharedMemoryManager, is also provided in the multiprocessing.managers module.


2 Answers

If you want to get the counts ignoring order, use a frozenset with Counter:

from collections import Counter

print(Counter(map(frozenset, y)))

Using the tuples from another answer:

In [9]: len(tuples)
Out[9]: 500000

In [10]: timeit Counter(map(frozenset, tuples))
1 loops, best of 3: 582 ms per loop

Using a frozenset will mean (1, 2) and (2,1) will be considered the same:

In [12]: y = [(1, 2), (2, 3), (1, 2), (5, 6),(2, 1),(6,5)]

In [13]: from collections import Counter

In [14]: 

In [14]: print(Counter(map(frozenset, y)))
Counter({frozenset({1, 2}): 3, frozenset({5, 6}): 2, frozenset({2, 3}): 1})

If you apply the same logic using multiprocessing, it will obviously be considerably faster, even without it beats what has been provided using multiprocessing.

like image 190
Padraic Cunningham Avatar answered Oct 23 '22 05:10

Padraic Cunningham


You can follow a MapReduce approach.

from collections import Counter
from multiprocessing import Pool

NUM_PROCESSES = 8

y = [(1,2),(2,3),(1,2),(5,6)] * 10

## http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

## map
partial_counters = Pool(NUM_PROCESSES).map(Counter, chunks(y, NUM_PROCESSES))

## reduce
reduced_counter = reduce(Counter.__add__, partial_counters)

## Result is:
## Counter({(1, 2): 20, (5, 6): 10, (2, 3): 10})

The idea is:

  1. split your input list into chunks
  2. feed each chunk to a separate process that will independently compute the counts
  3. merged together all partial counts via a reduction operation.

EDIT: use chunks(map(frozenset, y), NUM_PROCESSES) to account for unordered pairs.

like image 30
mrucci Avatar answered Oct 23 '22 04:10

mrucci