I am creating a dictionary as follows: <pre class="prettyprint"><code>y=[(1,2),(2,3),(1,2),(5,6)] dict={} for tup in y: tup=tuple(sorted(tup)) if tup in dict.keys(): dict[tup]=dict[tup]+1 else: dict[tup]=1 </code></pre> However my actual <code>y</code> contains about 40 million tuples, is there a way to use the multiprocessing to speed up this process? Thanks

If you want to get the counts ignoring order, use a <code>frozenset</code> with Counter: <pre class="prettyprint"><code>from collections import Counter print(Counter(map(frozenset, y))) </code></pre> Using the <code>tuples</code> from another answer: <pre class="prettyprint"><code>In [9]: len(tuples) Out[9]: 500000 In [10]: timeit Counter(map(frozenset, tuples)) 1 loops, best of 3: 582 ms per loop </code></pre> Using a frozenset will mean <code>(1, 2)</code> and <code>(2,1)</code> will be considered the same: <pre class="prettyprint"><code>In [12]: y = [(1, 2), (2, 3), (1, 2), (5, 6),(2, 1),(6,5)] In [13]: from collections import Counter In [14]: In [14]: print(Counter(map(frozenset, y))) Counter({frozenset({1, 2}): 3, frozenset({5, 6}): 2, frozenset({2, 3}): 1}) </code></pre> If you apply the same logic using multiprocessing, it will obviously be considerably faster, even without it beats what has been provided using multiprocessing.

Multiprocessing module for updating a shared dictionary in Python

Tags:

python

python-multiprocessing

I am creating a dictionary as follows:

y=[(1,2),(2,3),(1,2),(5,6)]

dict={}

for tup in y:
    tup=tuple(sorted(tup))
    if tup in dict.keys():
        dict[tup]=dict[tup]+1
    else:
        dict[tup]=1

However my actual y contains about 40 million tuples, is there a way to use the multiprocessing to speed up this process?

Thanks

806

asked Dec 11 '15 10:12

laila

2 Answers

If you want to get the counts ignoring order, use a frozenset with Counter:

from collections import Counter

print(Counter(map(frozenset, y)))

Using the tuples from another answer:

In [9]: len(tuples)
Out[9]: 500000

In [10]: timeit Counter(map(frozenset, tuples))
1 loops, best of 3: 582 ms per loop

Using a frozenset will mean (1, 2) and (2,1) will be considered the same:

In [12]: y = [(1, 2), (2, 3), (1, 2), (5, 6),(2, 1),(6,5)]

In [13]: from collections import Counter

In [14]: 

In [14]: print(Counter(map(frozenset, y)))
Counter({frozenset({1, 2}): 3, frozenset({5, 6}): 2, frozenset({2, 3}): 1})

If you apply the same logic using multiprocessing, it will obviously be considerably faster, even without it beats what has been provided using multiprocessing.

190

answered Oct 23 '22 05:10

Padraic Cunningham

You can follow a MapReduce approach.

from collections import Counter
from multiprocessing import Pool

NUM_PROCESSES = 8

y = [(1,2),(2,3),(1,2),(5,6)] * 10

## http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

## map
partial_counters = Pool(NUM_PROCESSES).map(Counter, chunks(y, NUM_PROCESSES))

## reduce
reduced_counter = reduce(Counter.__add__, partial_counters)

## Result is:
## Counter({(1, 2): 20, (5, 6): 10, (2, 3): 10})

The idea is:

split your input list into chunks
feed each chunk to a separate process that will independently compute the counts
merged together all partial counts via a reduction operation.

EDIT: use chunks(map(frozenset, y), NUM_PROCESSES) to account for unordered pairs.

answered Oct 23 '22 04:10

mrucci

Related questions
                            
                                Vaultier is unusable for docker/ubuntu/debian (Python)
                            
                                How to detect black and gray from an image
                            
                                Open a file, read content, make content into a list using regex, then print list in python
                            
                                Use Python alongside C# in Windows UWP app
                            
                                Use Regex to extract file path and save it in python
                            
                                how to read Mat v7.3 files in python ？
                            
                                How to get the path of the Python library from within Python
                            
                                cython: relative cimport beyond main package is not allowed
                            
                                Recursive traversal of a dictionary in python (graph traversal)
                            
                                Python3 - 'Lock wait timeout exceeded; try restarting transaction' and only process on the database
                            
                                Find same data in two DataFrames of different shapes
                            
                                Django Rest Framework: redirect to Amazon S3 fails when using Token Authentication
                            
                                Writing to a pcap with scapy
                            
                                Missed values when creating a dictionary with two values
                            
                                Accuracy of model is 0.86 while AUC is 0.50?
                            
                                Can I group test methods and/or test classes in Python unittest
                            
                                Get address of read-only mmap object
                            
                                How to flush plots in IPython Notebook sequentially?
                            
                                Python - remove item from Dict/List [duplicate]
                            
                                Python can't set attribute on non-persistent property

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With