Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect Proper Nouns with WordNet?

Tags:

java

nlp

wordnet

I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.

To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fine, because although it could be someone's name it could also refer to a point. However, "Africa" is not.

like image 924
Nick Heiner Avatar asked Dec 28 '09 03:12

Nick Heiner


Video Answer


2 Answers

Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.

Updated:

Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:

String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
    WordNetDatabase database = WordNetDatabase.getFileInstance();
    Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
    isProperNoun = synsets.length > 0;
}

That would eliminate false positives like this:

If you build it...
As you wish...
Oh Romeo, Romeo...

And still catch just the capitalized nouns in

In the Book of Mark it says...
Have you heard The Roots or The Who recently?

but still give you false positives on

Mark the first instance...
Book 'em, Danno.

because they could be, but without context you don't know.

If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).

like image 135
Rob Van Dam Avatar answered Sep 21 '22 05:09

Rob Van Dam


If you use the linux command-line to use Wordnet, you can use 'wn -synsn' to get all the synsets of a word. The proper nouns will be capitalized. E.g.,

$: wn mark -synsn

   Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun mark
   15 senses of mark                                                       

   Sense 1
   mark, grade, score
         => evaluation, valuation, rating
   .
   .
   .
   Sense 8
   Mark, Saint Mark, St. Mark
         INSTANCE OF=> Apostle, Apostelic Father
         INSTANCE OF=> Evangelist
         INSTANCE OF=> saint

But, seriously, please don't rely only on Wordnet for this. There are potentially gazillions of proper nouns for which Wordnet will not fetch you any information. Try the name Henrik, for example!

You can, however, build a context for your word w from datasets like the Google n-gram corpus, and use such contexts to build a classifier that returns a confidence score (i.e., the classifier can say w is a proper noun with 0 <= c <= 1 confidence.)

like image 38
Chthonic Project Avatar answered Sep 22 '22 05:09

Chthonic Project