Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

understanding semcor corpus structure h

I'm learning NLP. I currently playing with Word Sense Disambiguation. I'm planning to use the semcor corpus as training data but I have trouble understanding the xml structure. I tried googling but did not get any resource describing the content structure of semcor.

<s snum="1">
<wf cmd="ignore" pos="DT">The</wf>
<wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf>
<wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf>
<wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf>
<wf cmd="ignore" pos="DT">an</wf>
<wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf>
<wf cmd="ignore" pos="IN">of</wf>
<wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf>
<wf cmd="ignore" pos="POS">'s</wf>
<wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf>
<wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf>
<wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf>
<punc>``</punc>
<wf cmd="ignore" pos="DT">no</wf>
<wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf>
<punc>''</punc>
<wf cmd="ignore" pos="IN">that</wf>
<wf cmd="ignore" pos="DT">any</wf>
<wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf>
<wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf>
<punc>.</punc>
</s>
  • I'm assuming wnsn is 'word sense'. Is it correct?
  • What does the attribute lexsn mean? How does it map to wordnet?
  • What does the attribute pn refer to? (third line)
  • How is the rdf attribute assigned? (again third line)
  • In general, what are the possible attributes?
like image 638
Sharmila Avatar asked Jan 03 '11 10:01

Sharmila


People also ask

What is SemCor?

SemCor: semantically annotated English corpus. The SemCor corpus is an English corpus with semantically annotated texts. The semantic analysis was done manually with WordNet 1.6 senses (SemCor version 1.6) and later automatically mapped to WordNet 3.0 (SemCor version 3.0).

What is SemCor dataset?

SemCor is a manually sense-annotated corpus divided in 352 documents for a total of 226,040 sense annotations. SemCor is, to our knowledge, the largest corpus manually annotated with WordNet senses, and is the main corpus used in the literature to train supervised WSD systems (Agirre et al., 2010b; Zhong and Ng, 2010).


1 Answers

The format is described in the "doc/cxtfile.txt" file in the SemCor 1.6 archive; for some reason, documentation is not included in later versions.

like image 136
Bkkbrad Avatar answered Sep 25 '22 15:09

Bkkbrad