Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement category based text tagging using WordNet or related to wordnet?

How to tag text using wordnet by word's category (java as a interfacer ) ?

Example

Consider the sentences:

1) Computers need keyboard , moniter , CPU to work.
2) Automobile uses gears and clutch .

Now my objective is , the example sentences have to be tagged as

  • 1st sentence

Computer/electronic
keyboard/electronic
CPU / electronic

  • 2nd sentence

    Automobile / mechanical
    gears / mechanical
    clutch / mechanical

some extra example ...

"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic

"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..

I want to implement above mentions objective in java, that is to tag nouns by it related category such as technical , mechanical , electrical etc.

How to do this using wordnet .

My Previous Works

To achieve my objective I created a index of terms in text files for each category and matched it with a title .. if it contains a word in text files , then title get classified.

For example

Automobile.txt have car , gear , wheel , clutch.
networking.txt have server,IP Address,TCP , RIP

This is the Algorithm:

String Classify (String title)
{
 String area;
 if (compareWordsFrom ("Automobile.txt",title) == true ) area = "Auto";
 if (compareWordsFrom ("Netoworking.txt",title) == true ) area = "Networking";
 if (compareWordsFrom ("metels.txt",title) == true ) area = "Metallurgy";
 return area;
}

it is very difficult to find related words to build the index. That is , the field automobile have 1000 of related terms which difficult to find.

To be precise , building index of terms manually is a heart-breaking process

I already used Stanford NLP , Open NLP , but they are tagging POS , but not satisfying what is need.

My Need
I need an automated way for my work . Do Natural Language Processing techniques able to do it. ?

Some suggesting to use wordnet library , but how can I use it since it is like dictionary , but I wants like ..

mechanical = {gear , turbine , engine ....) electronic = {microchip , RAM , ROM ,...)

Is there any word database available like in above mentioned structure ..

OR I is there is an ready-made library available ?

like image 709
Ragesh D Antony Avatar asked Feb 03 '14 17:02

Ragesh D Antony


1 Answers

You need to categorize a bunch of nouns (e.g. "car", "gear") into predefined categories (e.g. "automobile"). Although named-entity recognition is the proper way of getting this done, it has its issues, the main one being gathering enough annotated data for training the system properly.

WordNet can help by establishing semantic similarity between nouns, thereby helping you select categories based on similarity scores. There are several ways of establishing similarity scores. Some prominent ones are

  • Lin's information-theoretic definition of similarity
  • LESK, a score based on the extent of overlap of the dictionary definitions of the terms.
  • Wu & Palmer's score based on synset-depths

The basic idea is that similar terms are grouped under similar categories by an ontology (such as WordNet). Therefore, the distance between their categories in the category tree of the ontology will be shorter if they are closely related, and longer otherwise. Perhaps the simplest such score is the path-score:

PathScore(s1, s2) = 1/pathLength(s1, s2)

where pathLength is the length of the path in the aforementioned category tree.

To illustrate:

PathScore(*car*, *automobile*) = 1.0;     // path score is always between 0 and 1
WuPalmerScore(*car*, *automobile*) = 1.0; // Wu & Palmer's score is always between 0 and 1

PathScore(*engine*, *automobile*) = 0.25;
WuPalmerScore(*engine*, *automobile*) = 0.88;

PathScore(*microprocessor*, *automobile*) = 0.09;
WuPalmerScore(*microprocessor*, *automobile*) = 0.58;

So, as you can see, terms that you want in the same category will usually have higher similarity scores. The best library for doing this is WordNet Similarity for Java, which offers several similarity metrics for you to experiment with. They also have an online demo here.

Caveat WordNet will not perform well if you are trying to label proper nouns. For example, if you want Hyundai to be in the automobile category and Samsung in the electronics category, this won't help at all ... simply because WordNet does not categorize these nouns. There are other ontologies built on top of WordNet that may help you in this scenario:

  • One such well-known ontology is Yago.
  • Using Wikipedia categories is another successful approach.
like image 159
Chthonic Project Avatar answered Sep 30 '22 13:09

Chthonic Project