I have a large list of tuples e.g. [ (1,2), (1,3), (1,4), (2,1), (2,3) ]
etc. I want to convert it to [ (1, [1,2,3,4]), (2, [1,3] ) ]
efficiently. I am grouping tuples by the first element of each tuple i.e. (1,2), (1,3), (1,4)
becomes (1, [2,3,4])
(also see the Haskell version below). I doubt that this can be done in one pass? The input list is always ordered.
In python
in tried using defaultdict
which I thought was the natural solution without reinventing the wheel. It works well but it does not preserve the order of keys. One solution is to use ordered defaultdict
as explained here.
Anyway, I'd like to know the language independent and efficient solution to this problem. My current solution requires two passes and one call to set( )
on a list.
Update
I am thinking of implementing following Haskell version:
a = [ (1,2), (1,3), (1,4), (2,1), (2,3) ]
b = groupBy (\ x y -> fst x == fst y )
b
[[(1,2),(1,3),(1,4)],[(2,1),(2,3)]]
map (\x -> (fst .head $ x, map snd x ) ) b
[(1,[2,3,4]),(2,[1,3])]
I implemented two answers (coldspeed and pm2ring). On moderate size lists (upto 10^4 elements) PM2 ring solution is faster; at size 10^5, both takes same time, on larger list COLDSPEED starts winning. Below are the numbers (with python3).
First column is number of entries in list, second is time taken by coldspeed
and third column has time taken by pm2 ring
solutions. All times are in second.
10 0.0001 0.0000
100 0.0001 0.0000
1000 0.0005 0.0001
10000 0.0044 0.0014
100000 0.0517 0.0452
1000000 0.5579 1.5249
Script is here http://github.com/dilawar/playground/raw/master/Python/so_group_tuple.py
PM 2Ring
solution is even faster (roughly 3x - 5x) with Ashwini's suggestions.
10 4.887580871582031e-05 1.2636184692382812e-05
100 0.00010132789611816406 2.0742416381835938e-05
1000 0.0005109310150146484 0.000110626220703125
10000 0.004467487335205078 0.0009067058563232422
100000 0.05056118965148926 0.017516136169433594
1000000 0.6100358963012695 0.26450490951538086
10000000 6.092756509780884 2.8253660202026367
Somewhat mixed results. Last column is ratio of column 2 and column 3.
pypy so_group_tuple.py
(10, [1.6927719116210938e-05, 3.409385681152344e-05], 0.4965034965034965)
(100, [4.601478576660156e-05, 8.296966552734375e-05], 0.5545977011494253)
(1000, [0.010258913040161133, 0.0019040107727050781], 5.388054094665665)
(10000, [0.0002448558807373047, 0.00021600723266601562], 1.1335540838852096)
(100000, [0.002658843994140625, 0.0018231868743896484], 1.45834967961292)
(1000000, [0.0833890438079834, 0.02979302406311035], 2.7989452709245284)
(10000000, [1.0556740760803223, 0.6789278984069824], 1.5549133841124023)
I am going with PM 2Ring
solution since it is much faster till list size 10^5.
You can do this with itertools.groupby
, and using zip
to rearrange the data from the collected groups:
from itertools import groupby
from operator import itemgetter
a = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3)]
b = [(k, list(list(zip(*g))[1])) for k, g in groupby(a, itemgetter(0))]
print(b)
output
[(1, [2, 3, 4]), (2, [1, 3])]
That list comp is a bit dense. Here's a variation using a traditional for
loop that prints an intermediate result to make it a little easier to see what's going on.
b = []
for k, g in groupby(a, itemgetter(0)):
t = list(zip(*g))
print(t)
b.append(list(t[1]))
print('Output', b)
output
[(1, 1, 1), (2, 3, 4)]
[(2, 2), (1, 3)]
Output [[2, 3, 4], [1, 3]]
As Ashwini Chaudhary mentions in the comments, nesting another list comp in there makes the code much more readable, it's probably also more efficient, since it avoids a couple of calls.
b = [(k, [x for _, x in g]) for k, g in groupby(a, itemgetter(0))]
You may use the collections.OrderedDict
(import collections
first):
o = collections.OrderedDict()
for x in t:
o.setdefault(x[0], []).append(x[1])
Now, convert o.items()
to a list:
list(o.items())
# [(1, [2, 3, 4]), (2, [1, 3])]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With