Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing Spanish wordnet from NLTK

I am trying to use the Spanish Wordnet from the Open Multilingual Wordnet in NLTK 3.0, but it seems that it was not downloaded with the 'omw' package. For example, with a code like the following:

from nltk.corpus import wordnet as wn

print [el.lemma_names('spa') for el in wn.synsets('bank')]

I get the following error message:

IOError: No such file or directory: u'***/nltk_data/corpora/omw/spa/wn-data-spa.tab'

According to the documentation, Spanish should be included, in the 'omw' package, but it was not downloaded with it. Do you know why this could happen?

like image 999
papafe Avatar asked Oct 20 '14 20:10

papafe


1 Answers

Here's the full error traceback if a language is missing from the Open Multilingual WordNet in your nltk_data directory:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('bank')[0].lemma_names('spa')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 418, in lemma_names
    self._wordnet_corpus_reader._load_lang_data(lang)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1070, in _load_lang_data
    f = self._omw_reader.open('{0:}/wn-data-{0:}.tab'.format(lang))
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 198, in open
    stream = self._root.join(file).open(encoding)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 309, in join
    return FileSystemPathPointer(_path)
  File "/usr/local/lib/python2.7/dist-packages/nltk/compat.py", line 380, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 287, in __init__
    raise IOError('No such file or directory: %r' % _path)
IOError: No such file or directory: u'/home/alvas/nltk_data/corpora/omw/spa/wn-data-spa.tab'

So the first thing is to check whether it's installed automatically:

>>> import nltk
>>> nltk.download('omw')
[nltk_data] Downloading package omw to /home/alvas/nltk_data...
[nltk_data]   Package omw is already up-to-date!
Tru

Then you should go and check the nltk_data and find that 'spa' folder is missing:

alvas@ubi:~/nltk_data/corpora/omw$ ls
als  arb  cmn  dan  eng  fas  fin  fra  fre  heb  ita  jpn  mcr  msa  nor  pol  por  README  tha

So here's the short term solution:

$ wget http://compling.hss.ntu.edu.sg/omw/wns/spa.zip
$ mkdir ~/nltk_data/corpora/omw/spa
$ unzip -p spa.zip mcr/wn-data-spa.tab > ~/nltk_data/corpora/omw/spa/wn-data-spa.tab

Alternatively, you can simply copy the file from nltk_data/corpora/omw/mcr/wn-data-spa.tab.

[out]:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('bank')[0].lemma_names('spa')
[u'margen', u'orilla', u'vera']

Now the lemma_names() should work for Spanish, if you're looking for other languages from the Open Multilingusl Wordnet, you can browse here (http://compling.hss.ntu.edu.sg/omw/) and then download and put in the respective nltk_data directory.

The long term solution would be to ask the devs from NLTK and OMW project to update their datasets for their NLTK API.

like image 171
alvas Avatar answered Sep 28 '22 17:09

alvas