Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:
["category1",("data","data","data")]
["category2", ("data","data","data")]
I call the different categories posts and I want to get the most frequent words from the data section. So I tried:
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
if token in freq_dict:
freq_dict[token] += 1
else:
freq_dict[token] = 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
However, this will give me the top words PER post in the string.
I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?
Why not just use Counter?
In [30]: from collections import Counter
In [31]: data=["category1",("data","data","data")]
In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})
In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
This is a scope problem. Also, you don't need to initialize the elements of defaultdict
, so this simplifies your code:
Try it like this:
posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
freq_dict[token] += 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
This, as expected, outputs
['data1', 'data3', 'data5', 'data2']
as a result.
If you really have something like
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
as an input, you won't need wordpunct_tokenize()
as the input data is already tokenized. Then, the following would work:
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, tokens in posts:
for token in tokens:
freq_dict[token] += 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
and it also outputs the expected result:
['data1', 'data3', 'data5', 'data2']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With