For specific purposes I have to use the Wordnet 1.6 instead of the current version implemented in the nltk package. I then downloaded the old version here and tried to run a simple extract of code using the french option.
from collections import defaultdict
import nltk
#nltk.download()
import os
import sys
from nltk.corpus import WordNetCorpusReader
cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"
wn16_path = "{0}/dict".format(wordnet16_dir)
wn = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))
senses=wn.synsets('gouvernement',lang=u'fre')
It seems that the wordnet I manually downloaded cannot be linked to the files of the nltk module dealing with foreign languages, the error I get is the following :
Traceback (most recent call last):
File "C:/Users/Stephanie/Test/temp.py", line 16, in <module>
senses=wn.synsets('gouvernement',lang=u'fre')
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1419, in synsets
self._load_lang_data(lang)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1064, in _load_lang_data
if lang not in self.langs():
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1088, in langs
fileids = self._omw_reader.fileids()
AttributeError: 'FileSystemPathPointer' object has no attribute 'fileids'
Using an english word doesn't generate any error (so it's not that I did not load the dictionary well) :
senses=wn.synsets('government')
print senses
[Synset('government.n.01'), Synset('government.n.02'), Synset('government.n.03'), Synset('politics.n.02')]
If I use the current version of Wordnet loaded with the nltk module I don't have any problem using french (so it's not a syntax problem with the optional argument)
from nltk.corpus import wordnet as wn
senses=wn.synsets('gouvernement',lang=u'fre')
print senses
[Synset('government.n.02'), Synset('opinion.n.05'), Synset('government.n.03'), Synset('rule.n.01'), Synset('politics.n.02'), Synset('government.n.01'), Synset('regulation.n.03'), Synset('reign.n.03')]
But, as precised, I really have to use the old version. I guess this might be a path problem. I've been trying to read the code of the WordNetCorpusReader function but I am quite new with python I don't really see what the problem is so far, except that it doesn't find a special file.
The needed file seems to be wn-data-fre.tab which is located in \nltk_data\corpora\omw\fre. I am pretty sure that I have to change the file with a version compatible with wordnet 1.6 but still, why the function WordNetCorpusReader can't find it ?
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.
Wordnet is a large lexical database of English, which was created by Princeton. It is a part of the NLTK corpus. Nouns, verbs, adjectives and adverbs all are grouped into set of synsets, i.e., cognitive synonyms. Here each set of synsets express a distinct meaning.
Getting the Synsets of a word Synsets of a word are other words with the same meaning as the supplied word. To get the Synsets of the word given, we use the function wordnet. synsets('word') . The function returns an array containing all the Synsets related to the word passed as the argument.
Short Answer:
There is no WordNet 1.6 with the language parameter. There's no way to use lang='fre'
when loading a different WordNet through NLTK.
Long Answer:
The lang=...
parameter is an addition made using the Open Multilingual WordNet (OMW: http://compling.hss.ntu.edu.sg/omw/) that links wordnet of different languages to the Princeton WordNet version 3.0. See https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1050.
The lang=...
parameter calls the function:
def langs(self):
''' return a list of languages supported by Multilingual Wordnet '''
import os
langs = []
fileids = self._omw_reader.fileids()
for fileid in fileids:
file_name, file_extension = os.path.splitext(fileid)
if file_extension == '.tab':
langs.append(file_name.split('-')[-1])
return langs
That looks for the file, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1070:
f = self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))
So if lang == 'fre'
, then self._omw_reader = wn-data-fre.tab
.
And the main reason why the omw can't find the wn-data-fre.tab
in nltk_data/corpora/omw/
because you've set the omw_reader
to wn16_path
when initializing the WordNetCorpusReader
object, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1006.
Then when loading the french data, it can't find self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))
. (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1419 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1070)
What you can try to do is this load 2 instances of WordNet:
import os
from nltk.corpus import wordnet as wn
from nltk.corpus import WordNetCorpusReader
cwd = os.getcwd()
nltk.data.path.append(cwd)
wordnet16_dir="wordnet-1.6/"
wn16_path = "{0}/dict".format(wordnet16_dir)
wn16 = WordNetCorpusReader(os.path.abspath("{0}/{1}".format(cwd, wn16_path)), nltk.data.find(wn16_path))
def synset2offset(ss):
return str(ss.offset()).zfill(8) + '-' + ss.pos()
wn16_ids = [synset2offset(ss) for ss in wn16.all_synsets()]
wn30_ids = [synset2offset(ss) for ss in wn.all_synsets()]
senses30 = wn.synsets('gouvernement',lang=u'fre')
senses16 = [ss for ss in wn.synsets('gouvernement',lang=u'fre') if synset2offset(ss) in wn16_ids]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With