Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the language option in synsets (nltk) if you load a wordnet manually?

For specific purposes I have to use the Wordnet 1.6 instead of the current version implemented in the nltk package. I then downloaded the old version here and tried to run a simple extract of code using the french option.

from collections import defaultdict
import nltk
#nltk.download() 
import os
import sys
from nltk.corpus import WordNetCorpusReader

cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"
wn16_path = "{0}/dict".format(wordnet16_dir)
wn = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))

senses=wn.synsets('gouvernement',lang=u'fre')

It seems that the wordnet I manually downloaded cannot be linked to the files of the nltk module dealing with foreign languages, the error I get is the following :

Traceback (most recent call last):
File "C:/Users/Stephanie/Test/temp.py", line 16, in <module>
senses=wn.synsets('gouvernement',lang=u'fre')
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1419, in synsets
self._load_lang_data(lang)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1064, in _load_lang_data
if lang not in self.langs():
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1088, in langs
fileids = self._omw_reader.fileids()
AttributeError: 'FileSystemPathPointer' object has no attribute 'fileids'

Using an english word doesn't generate any error (so it's not that I did not load the dictionary well) :

senses=wn.synsets('government')
print senses

[Synset('government.n.01'), Synset('government.n.02'), Synset('government.n.03'), Synset('politics.n.02')]

If I use the current version of Wordnet loaded with the nltk module I don't have any problem using french (so it's not a syntax problem with the optional argument)

from nltk.corpus import wordnet as wn
senses=wn.synsets('gouvernement',lang=u'fre')
print senses
[Synset('government.n.02'), Synset('opinion.n.05'), Synset('government.n.03'), Synset('rule.n.01'), Synset('politics.n.02'), Synset('government.n.01'), Synset('regulation.n.03'), Synset('reign.n.03')]

But, as precised, I really have to use the old version. I guess this might be a path problem. I've been trying to read the code of the WordNetCorpusReader function but I am quite new with python I don't really see what the problem is so far, except that it doesn't find a special file.

The needed file seems to be wn-data-fre.tab which is located in \nltk_data\corpora\omw\fre. I am pretty sure that I have to change the file with a version compatible with wordnet 1.6 but still, why the function WordNetCorpusReader can't find it ?

like image 239
Stéphanie C Avatar asked Jul 17 '15 14:07

Stéphanie C


People also ask

What does NLTK WordNet do?

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.

What are synsets in WordNet?

Wordnet is a large lexical database of English, which was created by Princeton. It is a part of the NLTK corpus. Nouns, verbs, adjectives and adverbs all are grouped into set of synsets, i.e., cognitive synonyms. Here each set of synsets express a distinct meaning.

How do I get WordNet Synset?

Getting the Synsets of a word Synsets of a word are other words with the same meaning as the supplied word. To get the Synsets of the word given, we use the function wordnet. synsets('word') . The function returns an array containing all the Synsets related to the word passed as the argument.


1 Answers

Short Answer:

There is no WordNet 1.6 with the language parameter. There's no way to use lang='fre' when loading a different WordNet through NLTK.


Long Answer:

The lang=... parameter is an addition made using the Open Multilingual WordNet (OMW: http://compling.hss.ntu.edu.sg/omw/) that links wordnet of different languages to the Princeton WordNet version 3.0. See https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1050.

The lang=... parameter calls the function:

def langs(self):
    ''' return a list of languages supported by Multilingual Wordnet '''
    import os
    langs = []
    fileids = self._omw_reader.fileids()
    for fileid in fileids:
        file_name, file_extension = os.path.splitext(fileid)
        if file_extension == '.tab':
            langs.append(file_name.split('-')[-1])

    return langs

That looks for the file, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1070:

 f = self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))

So if lang == 'fre', then self._omw_reader = wn-data-fre.tab.

And the main reason why the omw can't find the wn-data-fre.tab in nltk_data/corpora/omw/ because you've set the omw_reader to wn16_path when initializing the WordNetCorpusReader object, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1006.

Then when loading the french data, it can't find self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang)). (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1419 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1070)


What you can try to do is this load 2 instances of WordNet:

import os
from nltk.corpus import wordnet as wn
from nltk.corpus import WordNetCorpusReader

cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"

wn16_path = "{0}/dict".format(wordnet16_dir)
wn16 = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))

def synset2offset(ss):
    return str(ss.offset()).zfill(8) + '-' + ss.pos()


wn16_ids = [synset2offset(ss) for ss in wn16.all_synsets()]
wn30_ids = [synset2offset(ss) for ss in wn.all_synsets()]


senses30 = wn.synsets('gouvernement',lang=u'fre')
senses16 = [ss for ss in wn.synsets('gouvernement',lang=u'fre') if synset2offset(ss) in wn16_ids]
like image 193
alvas Avatar answered Sep 23 '22 19:09

alvas