Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I find the most common words in multiple separate texts?

Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. So I tried:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string.

I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?

like image 732
Shifu Avatar asked Jan 13 '23 11:01

Shifu


2 Answers

Why not just use Counter?

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
like image 107
Fredrik Pihl Avatar answered Jan 26 '23 14:01

Fredrik Pihl


This is a scope problem. Also, you don't need to initialize the elements of defaultdict, so this simplifies your code:

Try it like this:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs

['data1', 'data3', 'data5', 'data2']

as a result.

If you really have something like

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. Then, the following would work:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result:

['data1', 'data3', 'data5', 'data2']
like image 38
likeitlikeit Avatar answered Jan 26 '23 13:01

likeitlikeit