I have a list with more than 100 millions of tuples, with key-value elements like this:
list_a = [(1,'a'), (2,'b'), (1,'a'), (3,'b'), (3,'b'), (1,'a')]
I need to output a second list like this:
list_b = [(1,'a', 3), (2, 'b', 1), (3, 'b', 2) ]
Last element in a tuple is the count of duplicates in the list for such tuple. Order in list_b doesn't matter.
Then, I wrote this code:
import collections
list_b = []
for e, c in collections.Counter(list_a).most_common():
list_b.append("{}, {}, {}".format(e[0], e[1], c))
Running with 1000 tuples it last 2 seconds approximately... figure out how long will take with more that 100 millions. Any idea to speed it up?
Your bottle neck is using list.append method, since it's running on native python instead of the innate C code, it'll perform much slower.
You can opt to use list comprehension instead and it'll be much faster:
c = Counter(list_a)
result = [(*k, v) for k, v in c.items()]
Ran this on a 1000 item list on my machine, it was pretty quick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With