Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort words by their usage

Tags:

python

nltk

I have a list of english words (approx 10000) and I'd like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Python or other language? I heard about NLTK which is the closest library I know that could help. Or is this task for other tool?

thank you

like image 764
xralf Avatar asked Dec 28 '22 12:12

xralf


1 Answers

Python and NLTK are the perfect tools to sort your wordlist, as the NLTK comes with some corpora of the english language, from which you can extract frequency information.

The following code will print a given wordlist in the order of word frequency in the brown corpus:

import nltk
from nltk.corpus import brown

wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
    print w

output:

>>> 
the
house
Peter
corpus
asdf

If you want to use a different corpus or get more information you should have a look at chapter 2 of the nltk book.

like image 103
tobigue Avatar answered Jan 08 '23 03:01

tobigue