Like this question, I am interested in getting a large list of words by part of speech (a long list of nouns; a list of adjectives) to be used programmatically elsewhere. This answer has a solution using the WordNet database (in SQL) format.
Is there a way to get at such list using the corpora/tools built into the Python NLTK. I could take a large bunch of text, parse it and then store the nouns and adjectives. But given the dictionaries and other tools built in, is there a smarter way to simply extract the words that are already present in the NLTK datasets, encoded as nouns/adjectives (whatever)?
Thanks.
It's worth noting that Wordnet is actually one of the corpora included in the NLTK downloader by default. So you could conceivably just use the solution you already found without having to reinvent any wheels.
For instance, you could just do something like this to get all noun synsets:
from nltk.corpus import wordnet as wn
for synset in list(wn.all_synsets('n')):
print synset
# Or, equivalently
for synset in list(wn.all_synsets(wn.NOUN)):
print synset
That example will give you every noun that you want and it will even group them into their synsets so you can try to be sure that they're being used in the correct context.
If you want to get them all into a list you can do something like the following (though this will vary quite a bit based on how you want to use the words and synsets):
all_nouns = []
for synset in wn.all_synsets('n'):
all_nouns.extend(synset.lemma_names())
Or as a one-liner:
all_nouns = [word for synset in wn.all_synsets('n') for word in synset.lemma_names()]
You should use the Moby Parts of Speech Project data. Don't be fixated on using only what is directly in NLTK by default. It would be little work to download the files for this and pretty easy to parse them with NLTK once loaded.
I saw a similar question earlier this week (can't find the link), but like I said then, I don't think maintaining a list of nouns/adjectives/whatever is a great idea. This is primarily because the same word can have different parts of speech, depending on the context.
However, if you are still dead set on using these lists, then here's how I would do it (I don't have a working NLTK install on this machine, but I remember the basics):
nouns = set()
for sentence in my_corpus.sents():
# each sentence is either a list of words or a list of (word, POS tag) tuples
for word, pos in nltk.pos_tag(sentence): # remove the call to nltk.pos_tag if `sentence` is a list of tuples as described above
if pos in ['NN', "NNP"]: # feel free to add any other noun tags
nouns.add(word)
Hope this helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With