Does anyone know of a good way to calculate the "semantic distance" between two words?
Immediately an algorithm that counts the steps between words in a thesaurus springs to mind.
OK, looks like a similar question has already been answered: Is there an algorithm that tells the semantic similarity of two phrases.
Semantic distance is a measure of the closeness in meaning of two concepts. People are consis- tent judges of semantic distance. For example, we can easily tell that the concepts of “exercise” and “jog” are closer in meaning than “exercise” and “theater”.
Semantic similarity is calculated based on two semantic vectors. An order vector is formed for each sentence which considers the syntactic similarity between the sentences. Finally, semantic similarity is calculated based on semantic vectors and order vectors.
Semantic similarity between two pieces of text measures how their meanings are close. This measure usually is a score between 0 and 1. 0 means not close at all, and 1 means they almost have identical meaning.
In text mining there is an important maxim: "You shall know a word by the company it keeps". It means that it is possible to learn the meaning of a word based on the terms that frequently appear close to it.
Without entering in extensive details, let me give two simple options to estimate semantic distance between terms:
Use a resource similar to WordNet (a large lexical database of English). WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. The semantic distance between words can be estimated as the number of vertices that connect the two words.
Using a large corpus (e.g. Wikipedia), count the terms that appear close to the words you are analyzing. Create two vector and compute a distance (e.g cosine).
You can check this materials to get a get picture about the subject:
http://www.saifmohammad.com/WebDocs/Mohammad_Saif_Thesis-slides.pdf
http://www.umiacs.umd.edu/~saif/WebDocs/distributionalmeasures.pdf
http://www.umiacs.umd.edu/~saif/WebDocs/Measuring-Semantic-Distance.pdf
The thesaurus idea has some merit. One idea would be to create a graph based on a thesaurus with the nodes being the words and an edge indicating that there they are listed as synonyms in the thesaurus. You could then use a shortest path algorithm to give you the distance between the nodes as a measure of their similarity.
One difficulty here is that some words have different meanings in different contexts. Your algorithm may need to take this into account and use directed links with the weight of the outgoing link dependent on the incoming link being followed (or ignore some outgoing links based on the incoming link).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With