Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unique word frequency using NLTK

Tags:

python

token

nltk

Code to get the unique Word Frequency for the following using NLTK.

Seq Sentence
1 Let's try to be Good.
2 Being good doesn't make sense.
3 Good is always good.

Output:
{'good':3, 'let':1, 'try':1, 'to':1, 'be':1, 'being':1, 'doesn':1, 't':1, 'make':1, 'sense':1, 'is':1, 'always':1, '.':3, ''':2, 's':1}

like image 731
Sam Gladio Avatar asked Apr 20 '18 10:04

Sam Gladio


2 Answers

If you are very particular about using nltk you the refer the following code snippet

import nltk

text1 = '''Seq Sentence 
1   Let's try to be Good.
2   Being good doesn't make sense.
3   Good is always good.'''

words = nltk.tokenize.word_tokenize(text1)
fdist1 = nltk.FreqDist(words)

filtered_word_freq = dict((word, freq) for word, freq in fdist1.items() if not word.isdigit())

print(filtered_word_freq)

Hope it helps.

Referred some parts from:

How to check if string input is a number?

Dropping specific words out of an NLTK distribution beyond stopwords

like image 140
Afsan Abdulali Gujarati Avatar answered Sep 19 '22 16:09

Afsan Abdulali Gujarati


Try this

from collections import Counter
import pandas as pd
import nltk

sno = nltk.stem.SnowballStemmer('english')
s = "1   Let's try to be Good. 2   Being good doesn't make sense. 3   Good is always good."
s1 = s.split(' ')
d = pd.DataFrame(s1)
s2 = d[0].apply(lambda x: sno.stem(x))
counts =  Counter(s2)
print(counts)

Output will be:

Counter({'': 6, 'be': 2, 'good.': 2, 'good': 2, '1': 1, 'let': 1, 'tri': 1, 'to': 1, '2': 1, "doesn't": 1, 'make': 1, 'sense.': 1, '3': 1, 'is': 1, 'alway': 1})
like image 42
Akash Srivastava Avatar answered Sep 19 '22 16:09

Akash Srivastava