Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the Synset offset in Wordnet for use in Imagenet

Tags:

java

nlp

wordnet

I plan to use Image-Net to build a list of synonyms for a language task. According to the Image-Net API Docs,

ImageNet is based upon WordNet 3.0. To uniquely identify a synset, we use "WordNet ID" (wnid), which is a concatenation of POS ( i.e. part of speech ) and SYNSET OFFSET of WordNet.

This all seems well and good, however there is not a single bit of documentation on how to get the SYNSET OFFSET for a synset in WordNet. This RiTaWN tutorial explains how to get the Sense ID, however these are not the same values.

How can I get the SYNSET OFFSET so I can begin to use the Image-Net API to build my list of picturable nouns and synonyms?

like image 815
Danny Delott Avatar asked Nov 01 '22 01:11

Danny Delott


1 Answers

In index.noun, here is one of the more interesting entries:

car n 5 6 @ ~ #m #p %p - 5 2 02958343 02959942 02960501 02960352 02934451

The numbers are the "synset offset" you seek. So, taking the first number in the car row, 02958343, prefixing it with the second field, "n" (all entries in index.noun have "n" in the second field, of course), you get n02958343, which gives you: http://image-net.org/synset?wnid=n02958343

If you try with the 5th number in the list then you get images for cable cars.

By the way, the documentation for the rest of the index.noun fields is here: https://wordnet.princeton.edu/wordnet/man/wndb.5WN.html

The same synset offset is used through the data.noun file (which is the file that stores all the links between synsets).


BTW, the synset offsets change from release to release, and ImageNet is tied to Wordnet 3.0 it seems (or intends to break all URL for each WordNet release). E.g. this is how car looks in wordnet 3.1:

car n 5 6 @ ~ #m #p %p - 5 2 02961779 02963378 02963937 02963788 02937835 

(but http://image-net.org/synset?wnid=n02961779 does not find car pictures)

This is why when I designed MLSN, I instead used "06car0" to mean the first synset of car (06 mean noun.artifact, see WordNet docs); then that unique key can survive WordNet updates. Unfortunately it did not catch on, so people still use Wordnet synset offsets.

like image 79
Darren Cook Avatar answered Nov 09 '22 10:11

Darren Cook