How do I count the 10 most common words in a multiple lists of tokenized words

Question

I have a data-set with a lot of lists of lists of tokenized words. for example:

['apple','banana','tomato']
['tomato','tree','pikachu']

I have around 40k lists like those, and I want to count the 10 most common words from all of the 40k lists together.

Anyone have any idea?

yatu · Accepted Answer

You could flatten the nested list with itertools.chain and take the most common words using Counter and its most_common method:

from itertools import chain
from collections import Counter

l = ['apple','banana','tomato'],['tomato','tree','pikachu']

Counter(chain(*l)).most_common(10)
# [('tomato', 2), ('apple', 1), ('banana', 1), ('tree', 1), ('pikachu', 1)]

How do I count the 10 most common words in a multiple lists of tokenized words

Tags:

python

count

nlp

cpu-word

Tomer Shalhon

1 Answers

yatu

Recent Activity

Donate For Us

How do I count the 10 most common words in a multiple lists of tokenized words

Tags:

python

count

nlp

cpu-word

Tomer Shalhon

1 Answers

yatu

Related questions

Recent Activity

Donate For Us