NLTK provides functions for printing all the words in the Brown (or Gutenberg) corpus. But the equivalent function does not seem to work on Wordnet.
Is there a way to do this through NLTK? If there is not, how might one do it?
This works:
from nltk.corpus import brown as b
print b.words()
This causes an AttributeError:
from nltk.corpus import wordnet as wn
print wn.words()
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.
WordNET is a lexical database of words in more than 200 languages in which we have adjectives, adverbs, nouns, and verbs grouped differently into a set of cognitive synonyms, where each word in the database is expressing its distinct concept.
For wordnet, it's a word sense resources so elements in the resource are indexed by senses (aka synsets
).
To iterate through synsets
:
>>> from nltk.corpus import wordnet as wn
>>> for ss in wn.all_synsets():
... print ss
... print ss.definition()
... break
...
Synset('able.a.01')
(usually followed by `to') having the necessary means or skill or know-how or authority to do something
For each synset (sense/concept), there is a list of words attached to it, called lemmas
: lemmas are the canonical ("root") form of the words we use to when we check a dictionary.
To get a full list of lemmas in wordnet using a one-liner:
>>> lemmas_in_wordnet = set(chain(*[ss.lemma_names() for ss in wn.all_synsets()]))
Interestingly, wn.words()
will also return all the lemma_names
:
>>> lemmas_in_words = set(i for i in wn.words())
>>> len(lemmas_in_wordnet)
148730
>>> len(lemmas_in_words)
147306
But strangely there're some discrepancies as to the total number of words collected using wn.words()
.
"Printing the full content" of wordnet into text seems to be something too ambitious, because wordnet
is structured sort of like a hierarchical graph, with synsets interconnected to each other and each synset has its own properties/attributes. That's why the wordnet files are not kept simply as a single textfile.
To see what a synset contains:
>>> first_synset = next(wn.all_synsets())
>>> dir(first_synset)
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']
Going through this howto
would be helpful in knowing how to access the information you need in wordnet: http://www.nltk.org/howto/wordnet.html
This will generate an output of synonyms of all words in synset:
from nltk.corpus import wordnet as wn
synonyms=[]
for word in wn.words():
print (word,end=":")
for syn in wn.synsets(word):
for l in syn.lemmas():
synonyms.append(l.name())
print(set(synonyms),end="\n")
synonyms.clear()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With