Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Percentage rank of matches using Levenshtein Distance matching

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched string. I want to present the results in ranked percentage list of top "N" (say 10) matches.

Since the search string can be longer or shorter than the individual dictionary strings, what would be an appropriate logic for expressing the distance as a percentage, which would qualitatively refelct how close "as a percentage" is each result to the query string, with 100% indicating an exact match.

I considered the following options:

Q = query string M = matched string PM = Percentage Match Option 1. PMi = (1 - Lev_distance(Q, Mi)/Strlen(Q)) * 100 Option 2. PMi = (1 - Lev_distance(Q, Mi)/max(Strlen(Q), strlen(Mi))) * 100 

Option 1 has the possibility of negative percentages in case the distance is greater than the search string length, where the match string is long. For example query "ABC" matched with "ABC Corp." would result in a negative match percent.

Option 2 does not appear to give a consistent percentage across a set of Mi, as each calculation would possibly use a different denominator and hence the resulting percentage values would not be normalized.

Only other way I can think of is to ditch the comparison of the lev_distance to either string lenghts, but instead present the comparitive distances of the top "N" matches as an inverse percentile rank (100-percentile-rank).

Any thoughts? Are there better approaches? I must be missing something as Levenshtein distance is probably the most common algorithm for fuzzy matches and this must be a very common problem.

like image 663
user1368587 Avatar asked May 01 '12 22:05

user1368587


People also ask

How is Levenshtein distance calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

What is Levenshtein ratio?

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

What is the difference between edit distance and Levenshtein distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

Is Levenshtein distance NLP?

The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.


1 Answers

I had a similar problem and this thread helped me to figure out a solution. Hope it can help others too.

int levDis = Lev_distance(Q, Mi) int bigger = max(strlen(Q), strlen(Mi)) double pct = (bigger - levDis) / bigger 

It should return 100% if both strings are exactly the same and 0% if they are totaly different.

(sorry if my english isn't that good)

like image 115
Celio Camargo Junior Avatar answered Oct 02 '22 14:10

Celio Camargo Junior