Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I print the entire contents of Wordnet (preferably with NLTK)?

NLTK provides functions for printing all the words in the Brown (or Gutenberg) corpus. But the equivalent function does not seem to work on Wordnet.

Is there a way to do this through NLTK? If there is not, how might one do it?

This works:

from nltk.corpus import brown as b
print b.words()

This causes an AttributeError:

from nltk.corpus import wordnet as wn
print wn.words()
like image 557
zadrozny Avatar asked Nov 05 '15 03:11

zadrozny


People also ask

What does NLTK WordNet do?

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.

What is WordNet in NLP?

WordNET is a lexical database of words in more than 200 languages in which we have adjectives, adverbs, nouns, and verbs grouped differently into a set of cognitive synonyms, where each word in the database is expressing its distinct concept.


2 Answers

For wordnet, it's a word sense resources so elements in the resource are indexed by senses (aka synsets).

To iterate through synsets:

>>> from nltk.corpus import wordnet as wn
>>> for ss in wn.all_synsets():
...     print ss
...     print ss.definition()
...     break
... 
Synset('able.a.01')
(usually followed by `to') having the necessary means or skill or know-how or authority to do something

For each synset (sense/concept), there is a list of words attached to it, called lemmas: lemmas are the canonical ("root") form of the words we use to when we check a dictionary.

To get a full list of lemmas in wordnet using a one-liner:

>>> lemmas_in_wordnet = set(chain(*[ss.lemma_names() for ss in wn.all_synsets()]))

Interestingly, wn.words() will also return all the lemma_names:

>>> lemmas_in_words  = set(i for i in wn.words())
>>> len(lemmas_in_wordnet)
148730
>>> len(lemmas_in_words)
147306

But strangely there're some discrepancies as to the total number of words collected using wn.words().

"Printing the full content" of wordnet into text seems to be something too ambitious, because wordnet is structured sort of like a hierarchical graph, with synsets interconnected to each other and each synset has its own properties/attributes. That's why the wordnet files are not kept simply as a single textfile.

To see what a synset contains:

>>> first_synset = next(wn.all_synsets())
>>> dir(first_synset)
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']

Going through this howto would be helpful in knowing how to access the information you need in wordnet: http://www.nltk.org/howto/wordnet.html

like image 56
alvas Avatar answered Oct 05 '22 22:10

alvas


This will generate an output of synonyms of all words in synset:

from nltk.corpus import wordnet as wn
synonyms=[]
for word in wn.words():
    print (word,end=":")
    for syn in wn.synsets(word):
      for l in syn.lemmas():
        synonyms.append(l.name())
    print(set(synonyms),end="\n")
    synonyms.clear()
like image 42
Raveena Avatar answered Oct 05 '22 22:10

Raveena