Methods for automated synonym detection

Question

I am currently working on a neural network based approach to short document classification, and since the corpuses I am working with are usually around ten words, the standard statistical document classification methods are of limited use. Due to this fact I am attempting to implement some form of automated synonym detection for the matches provided in the training. My question more specifically is about resolving a situation as follows:

Say I have classifications of "Involving Food", and one of "Involving Spheres" and a data set as follows:

"Eating Apples"(Food);"Eating Marbles"(Spheres); "Eating Oranges"(Food, Spheres);
"Throwing Baseballs(Spheres)";"Throwing Apples(Food)";"Throwing Balls(Spheres)";
"Spinning Apples"(Food);"Spinning Baseballs";

I am looking for an incremental method that would move towards the following linkages:

Eating --> Food
Apples --> Food
Marbles --> Spheres
Oranges --> Food, Spheres
Throwing --> Spheres
Baseballs --> Spheres
Balls --> Spheres
Spinning --> Neutral
Involving --> Neutral

I do realize that in this specific case these might be slightly suspect matches, but it illustrates the problems I am having. My general thoughts were that if I incremented a word for appearing opposite the words in a category, but in that case I would end up incidentally linking everything to the word "Involving", I then thought that I would simply decrement a word for appearing in conjunction with multiple synonyms, or with non-synonyms, but then I would lose the link between "Eating" and "Food". Does anyone have any clue as to how I would put together an algorithm that would move me in the directions indicated above?

Xantix · Accepted Answer

There is an unsupervized boot-strapping approach that was explained to me to do this.

There are different ways of applying this approach, and variants, but here's a simplified version.

Concept:

Start by a assuming that if two words are synonyms, then in your corpus they will appear in similar settings. (eating grapes, eating sandwich, etc.)

(In this variant I will use co-occurence as the setting).

Boot-Strapping Algorithm:

We have two lists,

one list will contain the words that co-occur with food items
one list will contain the words that are food items

Supervized Part

Start by seeding one of the lists, for instance I might write the word Apple on the food items list.

Now let the computer take over.

Unsupervized Parts

It will first find all words in the corpus that appear just before Apple, and sort them in order of most occuring.

Take the top two (or however many you want) and add them into the co-occur with food items list. For example, perhaps "eating" and "Delicious" are the top two.

Now use that list to find the next two top food words by ranking the words that appear to the right of each word in the list.

Continue this process expanding each list until you are happy with the results.

Once that's done

(you may need to manually remove some things from the lists as you go which are clearly wrong.)

Variants

This procedure can be made quite effective if you take into account the grammatical setting of the keywords.

Subj ate NounPhrase
NounPhrase are/is Moldy

The workers harvested the Apples. 
   subj       verb     Apples 

That might imply harvested is an important verb for distinguishing foods.

Then look for other occurrences of subj harvested nounPhrase

You can expand this process to move words into categories, instead of a single category at each step.

My Source

This approach was used in a system developed at the University of Utah a few years back which was successful at compiling a decent list of weapon words, victim words, and place words by just looking at news articles.

An interesting approach, and had good results.

Not a neural network approach, but an intriguing methodology.

Edit:

the system at the University of Utah was called AutoSlog-TS, and a short slide about it can be seen here towards the end of the presentation. And a link to a paper about it here

Steve · Answer

You could try LDA which is unsupervised. There is a supervised version of LDA but I can't remember the name! Stanford parser will have the algorithm which you can play around with. I understand it's not the NN approach you are looking for. But if you are just looking to group information together LDA would seem appropriate, especially if you are looking for 'topics'

Methods for automated synonym detection

Tags:

language-agnostic

artificial-intelligence

machine-learning

neural-network

nlp

Slater Victoroff

2 Answers

Concept:

Boot-Strapping Algorithm:

Supervized Part

Unsupervized Parts

Once that's done

Variants

My Source

Edit:

Xantix

Steve

Recent Activity

Donate For Us

Methods for automated synonym detection

Tags:

language-agnostic

artificial-intelligence

machine-learning

neural-network

nlp

Slater Victoroff

2 Answers

Concept:

Boot-Strapping Algorithm:

Supervized Part

Unsupervized Parts

Once that's done

Variants

My Source

Edit:

Xantix

Steve

Related questions

Recent Activity

Donate For Us