Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group list of tuples efficiently

I have a large list of tuples e.g. [ (1,2), (1,3), (1,4), (2,1), (2,3) ] etc. I want to convert it to [ (1, [1,2,3,4]), (2, [1,3] ) ] efficiently. I am grouping tuples by the first element of each tuple i.e. (1,2), (1,3), (1,4) becomes (1, [2,3,4]) (also see the Haskell version below). I doubt that this can be done in one pass? The input list is always ordered.

In python in tried using defaultdict which I thought was the natural solution without reinventing the wheel. It works well but it does not preserve the order of keys. One solution is to use ordered defaultdict as explained here.

Anyway, I'd like to know the language independent and efficient solution to this problem. My current solution requires two passes and one call to set( ) on a list.

Update

I am thinking of implementing following Haskell version:

a = [ (1,2), (1,3), (1,4), (2,1), (2,3) ] 
b = groupBy (\ x y -> fst x == fst y ) 
b 
[[(1,2),(1,3),(1,4)],[(2,1),(2,3)]]  
map (\x -> (fst .head $ x, map snd x ) ) b 
[(1,[2,3,4]),(2,[1,3])]

Performance of answers

I implemented two answers (coldspeed and pm2ring). On moderate size lists (upto 10^4 elements) PM2 ring solution is faster; at size 10^5, both takes same time, on larger list COLDSPEED starts winning. Below are the numbers (with python3).

First column is number of entries in list, second is time taken by coldspeed and third column has time taken by pm2 ring solutions. All times are in second.

10 0.0001 0.0000
100 0.0001 0.0000
1000 0.0005 0.0001
10000 0.0044 0.0014
100000 0.0517 0.0452
1000000 0.5579 1.5249

Script is here http://github.com/dilawar/playground/raw/master/Python/so_group_tuple.py

With Ashwini optimization

PM 2Ring solution is even faster (roughly 3x - 5x) with Ashwini's suggestions.

10 4.887580871582031e-05 1.2636184692382812e-05
100 0.00010132789611816406 2.0742416381835938e-05
1000 0.0005109310150146484 0.000110626220703125
10000 0.004467487335205078 0.0009067058563232422
100000 0.05056118965148926 0.017516136169433594
1000000 0.6100358963012695 0.26450490951538086
10000000 6.092756509780884 2.8253660202026367

With PYPY

Somewhat mixed results. Last column is ratio of column 2 and column 3.

pypy so_group_tuple.py 
(10, [1.6927719116210938e-05, 3.409385681152344e-05], 0.4965034965034965)
(100, [4.601478576660156e-05, 8.296966552734375e-05], 0.5545977011494253)
(1000, [0.010258913040161133, 0.0019040107727050781], 5.388054094665665)
(10000, [0.0002448558807373047, 0.00021600723266601562], 1.1335540838852096)
(100000, [0.002658843994140625, 0.0018231868743896484], 1.45834967961292)
(1000000, [0.0833890438079834, 0.02979302406311035], 2.7989452709245284)
(10000000, [1.0556740760803223, 0.6789278984069824], 1.5549133841124023)

I am going with PM 2Ring solution since it is much faster till list size 10^5.

like image 826
Dilawar Avatar asked Nov 28 '22 13:11

Dilawar


2 Answers

You can do this with itertools.groupby, and using zip to rearrange the data from the collected groups:

from itertools import groupby
from operator import itemgetter

a = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3)]
b = [(k, list(list(zip(*g))[1])) for k, g in groupby(a, itemgetter(0))]
print(b)

output

[(1, [2, 3, 4]), (2, [1, 3])]

That list comp is a bit dense. Here's a variation using a traditional for loop that prints an intermediate result to make it a little easier to see what's going on.

b = []
for k, g in groupby(a, itemgetter(0)):
    t = list(zip(*g))
    print(t)
    b.append(list(t[1]))

print('Output', b)

output

[(1, 1, 1), (2, 3, 4)]
[(2, 2), (1, 3)]
Output [[2, 3, 4], [1, 3]]

As Ashwini Chaudhary mentions in the comments, nesting another list comp in there makes the code much more readable, it's probably also more efficient, since it avoids a couple of calls.

b = [(k, [x for _, x in g]) for k, g in groupby(a, itemgetter(0))]
like image 118
PM 2Ring Avatar answered Dec 05 '22 12:12

PM 2Ring


You may use the collections.OrderedDict (import collections first):

o = collections.OrderedDict()

for x in t:
    o.setdefault(x[0], []).append(x[1])  

Now, convert o.items() to a list:

list(o.items())
# [(1, [2, 3, 4]), (2, [1, 3])]
like image 35
cs95 Avatar answered Dec 05 '22 13:12

cs95