I am creating a dictionary as follows:
y=[(1,2),(2,3),(1,2),(5,6)]
dict={}
for tup in y:
tup=tuple(sorted(tup))
if tup in dict.keys():
dict[tup]=dict[tup]+1
else:
dict[tup]=1
However my actual y
contains about 40 million tuples, is there a way to use the multiprocessing to speed up this process?
Thanks
Python Multiprocessing Package Multiprocessing in Python is a package we can use with Python to spawn processes using an API that is much like the threading module. With support for both local and remote concurrency, it lets the programmer make efficient use of multiple processors on a given machine.
Source code: Lib/multiprocessing/shared_memory.py New in version 3.8. This module provides a class, SharedMemory, for the allocation and management of shared memory to be accessed by one or more processes on a multicore or symmetric multiprocessor (SMP) machine.
As you can see, the current_process () method gives us the name of the process that calls our function. See what happens when we don’t assign a name to one of the processes: Well, the Python Multiprocessing Module assigns a number to each process as a part of its name when we don’t.
To assist with the life-cycle management of shared memory especially across distinct processes, a BaseManager subclass, SharedMemoryManager, is also provided in the multiprocessing.managers module.
If you want to get the counts ignoring order, use a frozenset
with Counter:
from collections import Counter
print(Counter(map(frozenset, y)))
Using the tuples
from another answer:
In [9]: len(tuples)
Out[9]: 500000
In [10]: timeit Counter(map(frozenset, tuples))
1 loops, best of 3: 582 ms per loop
Using a frozenset will mean (1, 2)
and (2,1)
will be considered the same:
In [12]: y = [(1, 2), (2, 3), (1, 2), (5, 6),(2, 1),(6,5)]
In [13]: from collections import Counter
In [14]:
In [14]: print(Counter(map(frozenset, y)))
Counter({frozenset({1, 2}): 3, frozenset({5, 6}): 2, frozenset({2, 3}): 1})
If you apply the same logic using multiprocessing, it will obviously be considerably faster, even without it beats what has been provided using multiprocessing.
You can follow a MapReduce approach.
from collections import Counter
from multiprocessing import Pool
NUM_PROCESSES = 8
y = [(1,2),(2,3),(1,2),(5,6)] * 10
## http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
## map
partial_counters = Pool(NUM_PROCESSES).map(Counter, chunks(y, NUM_PROCESSES))
## reduce
reduced_counter = reduce(Counter.__add__, partial_counters)
## Result is:
## Counter({(1, 2): 20, (5, 6): 10, (2, 3): 10})
The idea is:
EDIT: use chunks(map(frozenset, y), NUM_PROCESSES)
to account for unordered pairs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With