Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating the semantic distance between words

Tags:

algorithm

Does anyone know of a good way to calculate the "semantic distance" between two words?

Immediately an algorithm that counts the steps between words in a thesaurus springs to mind.


OK, looks like a similar question has already been answered: Is there an algorithm that tells the semantic similarity of two phrases.

like image 509
Ben Aston Avatar asked Dec 30 '08 00:12

Ben Aston


People also ask

What is semantic distance between words?

Semantic distance is a measure of the closeness in meaning of two concepts. People are consis- tent judges of semantic distance. For example, we can easily tell that the concepts of “exercise” and “jog” are closer in meaning than “exercise” and “theater”.

How do you measure semantic similarity between words?

Semantic similarity is calculated based on two semantic vectors. An order vector is formed for each sentence which considers the syntactic similarity between the sentences. Finally, semantic similarity is calculated based on semantic vectors and order vectors.

What makes two words semantically similar?

Semantic similarity between two pieces of text measures how their meanings are close. This measure usually is a score between 0 and 1. 0 means not close at all, and 1 means they almost have identical meaning.


2 Answers

In text mining there is an important maxim: "You shall know a word by the company it keeps". It means that it is possible to learn the meaning of a word based on the terms that frequently appear close to it.

Without entering in extensive details, let me give two simple options to estimate semantic distance between terms:

  1. Use a resource similar to WordNet (a large lexical database of English). WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. The semantic distance between words can be estimated as the number of vertices that connect the two words.

  2. Using a large corpus (e.g. Wikipedia), count the terms that appear close to the words you are analyzing. Create two vector and compute a distance (e.g cosine).

You can check this materials to get a get picture about the subject:

  1. http://www.saifmohammad.com/WebDocs/Mohammad_Saif_Thesis-slides.pdf

  2. http://www.umiacs.umd.edu/~saif/WebDocs/distributionalmeasures.pdf

  3. http://www.umiacs.umd.edu/~saif/WebDocs/Measuring-Semantic-Distance.pdf

like image 143
mariolpantunes Avatar answered Oct 31 '22 04:10

mariolpantunes


The thesaurus idea has some merit. One idea would be to create a graph based on a thesaurus with the nodes being the words and an edge indicating that there they are listed as synonyms in the thesaurus. You could then use a shortest path algorithm to give you the distance between the nodes as a measure of their similarity.

One difficulty here is that some words have different meanings in different contexts. Your algorithm may need to take this into account and use directed links with the weight of the outgoing link dependent on the incoming link being followed (or ignore some outgoing links based on the incoming link).

like image 22
tvanfosson Avatar answered Oct 31 '22 06:10

tvanfosson