Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using WordNet to determine semantic similarity between two texts?

How can you determine the semantic similarity between two texts in python using WordNet?

The obvious preproccessing would be removing stop words and stemming, but then what?

The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?

like image 333
Zach Avatar asked Jul 13 '12 02:07

Zach


1 Answers

One thing that you can do is:

  1. Kill the stop words
  2. Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
  3. Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.

There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).

Hope this helps

like image 102
inspectorG4dget Avatar answered Sep 29 '22 13:09

inspectorG4dget