Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Print 10 most frequently occurring words of a text that including and excluding stopwords

I got the question from here with my changes. I have following code:

from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?

like image 954
user2064809 Avatar asked Feb 08 '15 10:02

user2064809


People also ask

What are NLTK Stopwords?

The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed. from nltk.tokenize import sent_tokenize, word_tokenize.


1 Answers

There is a FreqDist function in nltk

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    

to extract 10 most common:

mostCommon= allWordDist.most_common(10).keys()
like image 139
igorushi Avatar answered Oct 13 '22 22:10

igorushi