Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I tweak Levenshtein distance in classifying linguistically similar words (e.g. verb tenses, adjective comparisons, singular and plural)

I am out of ideas on how to complete this task. I am counting the frequency of a word, actually the base form of the word (e.g. running will be counted as run). I looked up on some implementations of Levenshtein distance (one implementation I run into is from dotnerperls).

I also tried the double Metaphone, but it isn't what I'm looking for.

So, please give me some ideas on how to tweak Levenshtein distance algorithm in classifying linguistically similar words since the algorithm is only for determining the number of edits needed not considering if they are linguistically similar or not

Example: 1. "running" will be counted as one occurrence of the word "run" 2. "word" will likewise be an occurrence of "word" 3. "fear" will NOT be counted as an occurrence of "gear"

Also, I am implementing it in C#.

Thanks in advance.

Edit: I edited it as Rene suggested. Another note: I am trying to consider to consider if a word is a substring of another word but that implementation will not be as much dynamic. Another idea I think is: "if adding -s or -ing to string1, string1 == string2, then string2 is an occurrence of string1." However, this is not the case as some words have irregular plurals.

like image 732
Jinnean Avatar asked Jan 07 '12 10:01

Jinnean


People also ask

How do you do Levenshtein distance?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

What is damerau Levenshtein distance used for?

The classical Levenshtein distance metric allows for the comparison between any two arbitrary strings. The "edit distance" measures how many additions, substitions, or deletions are needed to convert one string into another.

What is the difference between Hamming distance and Levenshtein distance?

The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different. The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).


1 Answers

The task you are trying to solve is called Stemming or Lemmatisation.

As you figured out already, Levenshtein-Distance is not the way to go here. Common stemming-algorithms for english include the Porter- and Snowball-Stemmer. If you google for that I'm sure you will find a C#-implementation of one of them.

like image 61
tobigue Avatar answered Oct 20 '22 20:10

tobigue