Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to calculate Levenshtein distance

Tags:

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root, String query) {   int arr[][] = new int[root.length() + 2][query.length() + 2];    for (int i = 2; i < root.length() + 2; i++)   {     arr[i][0] = (int) root.charAt(i - 2);     arr[i][1] = (i - 1);   }    for (int i = 2; i < query.length() + 2; i++)   {     arr[0][i] = (int) query.charAt(i - 2);     arr[1][i] = (i - 1);   }    for (int i = 2; i < root.length() + 2; i++)   {     for (int j = 2; j < query.length() + 2; j++)     {       int diff = 0;       if (arr[0][j] != arr[i][0])       {         diff = 1;       }       arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));     }   }   return arr[root.length() + 1][query.length() + 1]; }  public int min(int n1, int n2, int n3) {   return (int) Math.min(n1, Math.min(n2, n3)); } 
like image 573
efficiencyIsBliss Avatar asked Jul 06 '10 02:07

efficiencyIsBliss


People also ask

How is Levenshtein distance calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

How is damerau Levenshtein distance calculated?

The Damerau–Levenshtein distance LD(CA, ABC) = 2 because CA → AC → ABC, but the optimal string alignment distance OSA(CA, ABC) = 3 because if the operation CA → AC is used, it is not possible to use AC → ABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore ...

Is Levenshtein distance NLP?

The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.

What is the difference between edit distance and Levenshtein distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.


1 Answers

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

like image 102
Alex Martelli Avatar answered Sep 25 '22 11:09

Alex Martelli