Maybe this is a stupid question, but I have a problem with extracting the ten most frequent words out of a corpus with Python. This is what I've got so far. (btw, I work with NLTK for reading a corpus with two subcategories with each 10 .txt files)
import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')
from collections import defaultdict
from operator import itemgetter
def toptenwords(mycorpus):
words = mycorpus.words()
no_capitals = set([word.lower() for word in words])
filtered = [word for word in no_capitals if word not in stoplist]
no_punct = [s.translate(None, string.punctuation) for s in filtered]
wordcounter = {}
for word in no_punct:
if word in wordcounter:
wordcounter[word] += 1
else:
wordcounter[word] = 1
sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
return sorting
If I print this function with my corpus, it gives me a list of all words with '1' behind it. It gives me a dictionary but all my values are one. And I know that for example the word 'baby' is five or six times in my corpus... And still it gives 'baby: 1'... So it doesn't function the way I want...
Can someone help me?
Using Python's NLTK Library To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK. In the script above, we first import the stopwords collection from the nltk. corpus module.
If you're using the NLTK anyway, try the FreqDist(samples) function to first generate a frequency distribution from the given sample. Then call the most_common(n) attribute to find the n most common words in the sample, sorted by descending frequency. Something like:
from nltk.probability import FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist.most_common(10)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With