How do I find the most common words in multiple separate texts?

Question

Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. So I tried:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string.

I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?

Fredrik Pihl · Accepted Answer

Why not just use Counter?

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]

likeitlikeit · Answer

This is a scope problem. Also, you don't need to initialize the elements of defaultdict, so this simplifies your code:

Try it like this:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs

['data1', 'data3', 'data5', 'data2']

as a result.

If you really have something like

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. Then, the following would work:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result:

['data1', 'data3', 'data5', 'data2']

How do I find the most common words in multiple separate texts?

Tags:

python

for-loop

Shifu

2 Answers

Fredrik Pihl

likeitlikeit

Recent Activity

Donate For Us

How do I find the most common words in multiple separate texts?

Tags:

python

for-loop

Shifu

2 Answers

Fredrik Pihl

likeitlikeit

Related questions

Recent Activity

Donate For Us