Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count the 10 most common words in a multiple lists of tokenized words

I have a data-set with a lot of lists of lists of tokenized words. for example:

['apple','banana','tomato']
['tomato','tree','pikachu']

I have around 40k lists like those, and I want to count the 10 most common words from all of the 40k lists together.

Anyone have any idea?

like image 869
Tomer Shalhon Avatar asked Jan 21 '26 03:01

Tomer Shalhon


1 Answers

You could flatten the nested list with itertools.chain and take the most common words using Counter and its most_common method:

from itertools import chain
from collections import Counter

l = ['apple','banana','tomato'],['tomato','tree','pikachu']

Counter(chain(*l)).most_common(10)
# [('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]
like image 106
yatu Avatar answered Jan 22 '26 18:01

yatu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!