I got the question from here with my changes. I have following code:
from nltk.corpus import stopwords
def content_text(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() in stopwords]
return content
How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?
The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed. from nltk.tokenize import sent_tokenize, word_tokenize.
There is a FreqDist function in nltk
import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)
stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)
to extract 10 most common:
mostCommon= allWordDist.most_common(10).keys()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With