Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly are WordNet lexicographer files? Understanding how WordNet works

I'm trying to understand the file formats of the WordNet, and the main documents are WNDB and WNINPUT. As I understood in WNDB, there are the files called index.something and data.something, where this something can be noun, adv, vrb, adj.

So, if I want to know something about the word dog as a noun, I'd look into the index.noun, search for the word dog, which gives me the line:

dog n 7 5 @ ~ #m #p %p 7 1 02086723 10133978 10042764 09905672 07692347 03907626 02712903  

According to the WNDB documment, this line represents these data:

lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt   synset_offset  [synset_offset...] 

Where lemma is the word, pos is the identifier that tells it's a noun, synset_cnt tells us in how many synsets this word is included, p_cnt tells us how many pointers to these synsets we have, [ptr_symbol] is an array of pointers, sense_cnt and tagsense_cnt I didn't understand and would like an explanation, and synset_offset is one or more synsets to be looked into the data.noun file

Ok, so I know those pointers point to something, and here are their descriptions, as written in WNINPUT:

@    Hypernym 
 ~    Hyponym 
#m    Member holonym 
#p    Part holonym 
%p    Part meronym 

I don't know how to find a Hypernym for this noun, but let's continue:

The other important data are the synset_offsets, which are:

02086723 10133978 10042764 09905672 07692347 03907626 02712903  

Let's look at the first one, 02086723, in data.noun:

02086723 05 n 03 dog 0 domestic_dog 0 Canis_familiaris 0 023 @ 02085998 n 0000 @ 01320032 n 0000 #m 02086515 n 0000 #m 08011383 n 0000 ~ 01325095 n 0000 ~ 02087384 n 0000 ~ 02087513 n 0000 ~ 02087924 n 0000 ~ 02088026 n 0000 ~ 02089774 n 0000 ~ 02106058 n 0000 ~ 02112993 n 0000 ~ 02113458 n 0000 ~ 02113610 n 0000 ~ 02113781 n 0000 ~ 02113929 n 0000 ~ 02114152 n 0000 ~ 02114278 n 0000 ~ 02115149 n 0000 ~ 02115478 n 0000 ~ 02115987 n 0000 ~ 02116630 n 0000 %p 02161498 n 0000 | a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night" 

As you can see, we've found the line that begins with 02086723. The contents of this line are described in WNDB as:

synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  |   gloss 

synset_offset we already know,

lex_filenum says in which of the lexicographers file is our word (this is the part that I don't understand the most),

ss_type is n which tells us that it's a noun,

w_cnt: two digit hexadecimal integer indicating the number of words in the synset, which in this case is 03, which means we have 3 words in this synset: dog 0 domestic_dog 0 Canis_familiaris 0, each one followed by a number called:

lex_id: one digit hexadecimal integer that, when appended onto lemma , uniquely identifies a sense within a lexicographer file

p_cnt: counts the number of pointers, which in our case is `023`, so we have 23 pointers, wow

After p_cnt, then comes the pointers, each one in the format:

pointer_symbol  synset_offset  pos  source/target 

Where pointer_symbol is about the symbols like the ones I showed (@, ~, ...),

synset_offset: is the byte offset of the target synset in the data file corresponding to pos

source/target: field distinguishes lexical and semantic pointers. It is a four byte field, containing two two-digit hexadecimal integers. The first two digits indicates the word number in the current (source) synset, the last two digits indicate the word number in the target synset. A value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset indicated by synset_offset .

Ok, so let's examine the first pointer:

@ 02085998 n 0000

It's a pointer with symbol @, indicating it's a Hypernym, and points to the synset wiuth offset 02085998 of type n (noun), and source/target is 0000

When I search for in data.noun, I get

02085998 05 n 02 canine 0 canid 0 011 @ 02077948 n 0000 #m 02085690 n 0000 + 02688440 a 0101 ~ 02086324 n 0000 ~ 02086723 n 0000 ~ 02116752 n 0000 ~ 02117748 n 0000 ~ 02117987 n 0000 ~ 02119787 n 0000 ~ 02120985 n 0000 %p 02442560 n 0000 | any of various fissiped mammals with nonretractile claws and typically long muzzles  

which is an Hypernym of dog. So that's how you find relations betweet synsets. I guess the pointer symbols in the line for dog were just to inform which types of relations I could find for the word dog? Isn't it redundant? Because these pointer symbols are already in each of the synset_offsets as we seen. When we look at each synset_offset in data.noun, we can see those pointer symbols, so why they're necessary in the index.noun file?

Also, see that I didn't use the lexicographers file at all. I know that in data.noun, specifically in the field lex_filenum, I can know where the data structure for dog is located, but what is this structure for? As you can see, I could find hypernym, and many other relations, just by looking at the index and data files, I didn't use any of the so called lexicographer files

like image 496
PPP Avatar asked Feb 14 '17 02:02

PPP


Video Answer


2 Answers

Yes, the Wordnet documentation is rather hard to read...

You're looking for this page: https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html

During WordNet development synsets are organized into forty-five lexicographer files based on syntactic category and logical groupings

These groupings are some sort of parallel clusters (flat grouppings) to the hyper-hyponym hierarchical ontology.

In short:

From the docs:

File Format [of the lexicographer files in WordNet-3.0/dict/]

Each line in lexnames contains 3 tab separated fields, and is terminated with a newline character. The first field is the two digit decimal integer file number. (The first file in the list is numbered 00 .) The second field is the name of the lexicographer file that is represented by that number, and the third field is an integer that indicates the syntactic category of the synsets contained in the file. This is simply a shortcut for programs and scripts, since the syntactic category is also part of the lexicographer file's name.

In layman's explanation (me):

It's just a standard of how you should assign the values for the 2nd column in the files, e.g. data.nouns, data.verbs, etc.

Traditionally, Wordnet creators/maintainers should name their files accordingly but sometimes, it's easier to just put all nouns together and use the index of denote the synset's category.

The guidelines for the categories are as follows:

File Number Name    Contents
00  adj.all all adjective clusters
01  adj.pert    relational adjectives (pertainyms)
02  adv.all all adverbs
03  noun.Tops   unique beginner for nouns
04  noun.act    nouns denoting acts or actions
05  noun.animal nouns denoting animals
06  noun.artifact   nouns denoting man-made objects
07  noun.attribute  nouns denoting attributes of people and objects
08  noun.body   nouns denoting body parts
09  noun.cognition  nouns denoting cognitive processes and contents
10  noun.communication  nouns denoting communicative processes and contents
11  noun.event  nouns denoting natural events
12  noun.feeling    nouns denoting feelings and emotions
13  noun.food   nouns denoting foods and drinks
14  noun.group  nouns denoting groupings of people or objects
15  noun.location   nouns denoting spatial position
16  noun.motive nouns denoting goals
17  noun.object nouns denoting natural objects (not man-made)
18  noun.person nouns denoting people
19  noun.phenomenon nouns denoting natural phenomena
20  noun.plant  nouns denoting plants
21  noun.possession nouns denoting possession and transfer of possession
22  noun.process    nouns denoting natural processes
23  noun.quantity   nouns denoting quantities and units of measure
24  noun.relation   nouns denoting relations between people or things or ideas
25  noun.shape  nouns denoting two and three dimensional shapes
26  noun.state  nouns denoting stable states of affairs
27  noun.substance  nouns denoting substances
28  noun.time   nouns denoting time and temporal relations
29  verb.body   verbs of grooming, dressing and bodily care
30  verb.change verbs of size, temperature change, intensifying, etc.
31  verb.cognition  verbs of thinking, judging, analyzing, doubting
32  verb.communication  verbs of telling, asking, ordering, singing
33  verb.competition    verbs of fighting, athletic activities
34  verb.consumption    verbs of eating and drinking
35  verb.contact    verbs of touching, hitting, tying, digging
36  verb.creation   verbs of sewing, baking, painting, performing
37  verb.emotion    verbs of feeling
38  verb.motion verbs of walking, flying, swimming
39  verb.perception verbs of seeing, hearing, feeling
40  verb.possession verbs of buying, selling, owning
41  verb.social verbs of political and social activities and events
42  verb.stative    verbs of being, having, spatial relations
43  verb.weather    verbs of raining, snowing, thawing, thundering
44  adj.ppl participial adjectives

So for example in WordNet-3.0/dict/data.noun, we see lines:

00034213 03 n 01 phenomenon 0 008 @ 00029677 n 0000 ~ 11408559 n 0000 ~ 11408733 n 0000 ~ 11408914 n 0000 ~ 11410625 n 0000 ~ 11418138 n 0000 ~ 11418460 n 0000 ~ 11529295 n 0000 | any state or process known through the senses rather than by intuition or reasoning  
00034479 04 n 01 thing 0 001 @ 00037396 n 0000 | an action; "how could you do such a thing?"  

Look at the 2nd column, for phenomenon the value is 03 which points to noun.Tops.

For thing, it has the value 04 which refers to noun.act.


IMHO, depending on the usage, these assignments may not be useful. They are mostly use when creating the wordnet and how we can easily flatten ontological hierarchies into simple flat clusters.

like image 79
alvas Avatar answered Sep 27 '22 22:09

alvas


What is useful in this information is the relationship that exists between them, and (sometimes), the type of information. Everybody uses Wordnet! Some even link it to RDF notation. But... I have used Wordnet a few years ago, as I wanted to build a hypertree of words, their superclass(es) and subclass(es), plus a few other types of relationships that are absent in WN, I had to drop Wordnet and its jargon. I needed a 'less simplified' organization of "the real world". I came up with my own, with a mix of Wiktionary, lots of regular expressions, some YAGO, a few other ontologies that let me build hierarchies and other relationships, some ML. I have also looked at Roger Schank's classification, the Roget thesaurus, and various attempts to identify and classify (typologies) concepts, such as Wierzbicka's and other. If you want something serious, diy.

like image 38
Avner Levy Avatar answered Sep 27 '22 23:09

Avner Levy