I am out of ideas on how to complete this task. I am counting the frequency of a word, actually the base form of the word (e.g. running will be counted as run). I looked up on some implementations of Levenshtein distance (one implementation I run into is from dotnerperls).
I also tried the double Metaphone, but it isn't what I'm looking for.
So, please give me some ideas on how to tweak Levenshtein distance algorithm in classifying linguistically similar words since the algorithm is only for determining the number of edits needed not considering if they are linguistically similar or not
Example: 1. "running" will be counted as one occurrence of the word "run" 2. "word" will likewise be an occurrence of "word" 3. "fear" will NOT be counted as an occurrence of "gear"
Also, I am implementing it in C#.
Thanks in advance.
Edit: I edited it as Rene suggested. Another note: I am trying to consider to consider if a word is a substring of another word but that implementation will not be as much dynamic. Another idea I think is: "if adding -s or -ing to string1, string1 == string2, then string2 is an occurrence of string1." However, this is not the case as some words have irregular plurals.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
The classical Levenshtein distance metric allows for the comparison between any two arbitrary strings. The "edit distance" measures how many additions, substitions, or deletions are needed to convert one string into another.
The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different. The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).
The task you are trying to solve is called Stemming or Lemmatisation.
As you figured out already, Levenshtein-Distance is not the way to go here. Common stemming-algorithms for english include the Porter- and Snowball-Stemmer. If you google for that I'm sure you will find a C#-implementation of one of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With