I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched string. I want to present the results in ranked percentage list of top "N" (say 10) matches.
Since the search string can be longer or shorter than the individual dictionary strings, what would be an appropriate logic for expressing the distance as a percentage, which would qualitatively refelct how close "as a percentage" is each result to the query string, with 100% indicating an exact match.
I considered the following options:
Q = query string M = matched string PM = Percentage Match Option 1. PMi = (1 - Lev_distance(Q, Mi)/Strlen(Q)) * 100 Option 2. PMi = (1 - Lev_distance(Q, Mi)/max(Strlen(Q), strlen(Mi))) * 100
Option 1 has the possibility of negative percentages in case the distance is greater than the search string length, where the match string is long. For example query "ABC" matched with "ABC Corp." would result in a negative match percent.
Option 2 does not appear to give a consistent percentage across a set of Mi, as each calculation would possibly use a different denominator and hence the resulting percentage values would not be normalized.
Only other way I can think of is to ditch the comparison of the lev_distance to either string lenghts, but instead present the comparitive distances of the top "N" matches as an inverse percentile rank (100-percentile-rank).
Any thoughts? Are there better approaches? I must be missing something as Levenshtein distance is probably the most common algorithm for fuzzy matches and this must be a very common problem.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.
Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.
The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.
I had a similar problem and this thread helped me to figure out a solution. Hope it can help others too.
int levDis = Lev_distance(Q, Mi) int bigger = max(strlen(Q), strlen(Mi)) double pct = (bigger - levDis) / bigger
It should return 100% if both strings are exactly the same and 0% if they are totaly different.
(sorry if my english isn't that good)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With