I have a list of english words (approx 10000) and I'd like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Python or other language? I heard about NLTK
which is the closest library I know that could help. Or is this task for other tool?
thank you
Python and NLTK are the perfect tools to sort your wordlist, as the NLTK comes with some corpora of the english language, from which you can extract frequency information.
The following code will print a given wordlist
in the order of word frequency in the brown corpus:
import nltk
from nltk.corpus import brown
wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
print w
output:
>>>
the
house
Peter
corpus
asdf
If you want to use a different corpus or get more information you should have a look at chapter 2 of the nltk book.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With