Most efficient way to count duplicate tuples in python list

Question

I have a list with more than 100 millions of tuples, with key-value elements like this:

list_a = [(1,'a'), (2,'b'), (1,'a'), (3,'b'), (3,'b'), (1,'a')]

I need to output a second list like this:

list_b = [(1,'a', 3), (2, 'b', 1), (3, 'b', 2) ]

Last element in a tuple is the count of duplicates in the list for such tuple. Order in list_b doesn't matter.

Then, I wrote this code:

import collections

list_b = []

for e, c in collections.Counter(list_a).most_common():
    list_b.append("{}, {}, {}".format(e[0], e[1], c))

Running with 1000 tuples it last 2 seconds approximately... figure out how long will take with more that 100 millions. Any idea to speed it up?

r.ook · Accepted Answer

Your bottle neck is using list.append method, since it's running on native python instead of the innate C code, it'll perform much slower.

You can opt to use list comprehension instead and it'll be much faster:

c = Counter(list_a)
result = [(*k, v) for k, v in c.items()]

Ran this on a 1000 item list on my machine, it was pretty quick.

Donate For Us