I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.
To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fine, because although it could be someone's name it could also refer to a point. However, "Africa" is not.
Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.
Updated:
Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:
String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
WordNetDatabase database = WordNetDatabase.getFileInstance();
Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
isProperNoun = synsets.length > 0;
}
That would eliminate false positives like this:
If you build it...
As you wish...
Oh Romeo, Romeo...
And still catch just the capitalized nouns in
In the Book of Mark it says...
Have you heard The Roots or The Who recently?
but still give you false positives on
Mark the first instance...
Book 'em, Danno.
because they could be, but without context you don't know.
If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).
If you use the linux command-line to use Wordnet, you can use 'wn -synsn' to get all the synsets of a word. The proper nouns will be capitalized. E.g.,
$: wn mark -synsn
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun mark
15 senses of mark
Sense 1
mark, grade, score
=> evaluation, valuation, rating
.
.
.
Sense 8
Mark, Saint Mark, St. Mark
INSTANCE OF=> Apostle, Apostelic Father
INSTANCE OF=> Evangelist
INSTANCE OF=> saint
But, seriously, please don't rely only on Wordnet for this. There are potentially gazillions of proper nouns for which Wordnet will not fetch you any information. Try the name Henrik, for example!
You can, however, build a context for your word w from datasets like the Google n-gram corpus, and use such contexts to build a classifier that returns a confidence score (i.e., the classifier can say w is a proper noun with 0 <= c <= 1 confidence.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With