I have a data-set with a lot of lists of lists of tokenized words. for example:
['apple','banana','tomato']
['tomato','tree','pikachu']
I have around 40k lists like those, and I want to count the 10 most common words from all of the 40k lists together.
Anyone have any idea?
You could flatten the nested list with itertools.chain and take the most common words using Counter and its most_common method:
from itertools import chain
from collections import Counter
l = ['apple','banana','tomato'],['tomato','tree','pikachu']
Counter(chain(*l)).most_common(10)
# [('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With