I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same. Here's my implementation: <pre class="prettyprint lang-java prettyprint-override"><code>public int calculate(String root, String query) { int arr[][] = new int[root.length() + 2][query.length() + 2]; for (int i = 2; i < root.length() + 2; i++) { arr[i][0] = (int) root.charAt(i - 2); arr[i][1] = (i - 1); } for (int i = 2; i < query.length() + 2; i++) { arr[0][i] = (int) query.charAt(i - 2); arr[1][i] = (i - 1); } for (int i = 2; i < root.length() + 2; i++) { for (int j = 2; j < query.length() + 2; j++) { int diff = 0; if (arr[0][j] != arr[i][0]) { diff = 1; } arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff)); } } return arr[root.length() + 1][query.length() + 1]; } public int min(int n1, int n2, int n3) { return (int) Math.min(n1, Math.min(n2, n3)); } </code></pre>

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound <code>k</code> on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to <code>O(n times k)</code> instead of <code>O(n squared)</code> (basically by giving up as soon as the minimum possible distance becomes <code>> k</code>). Since you're looking for the closest match, you can progressively decrease <code>k</code> to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve. I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

Most efficient way to calculate Levenshtein distance

Tags:

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root, String query) {   int arr[][] = new int[root.length() + 2][query.length() + 2];    for (int i = 2; i < root.length() + 2; i++)   {     arr[i][0] = (int) root.charAt(i - 2);     arr[i][1] = (i - 1);   }    for (int i = 2; i < query.length() + 2; i++)   {     arr[0][i] = (int) query.charAt(i - 2);     arr[1][i] = (i - 1);   }    for (int i = 2; i < root.length() + 2; i++)   {     for (int j = 2; j < query.length() + 2; j++)     {       int diff = 0;       if (arr[0][j] != arr[i][0])       {         diff = 1;       }       arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));     }   }   return arr[root.length() + 1][query.length() + 1]; }  public int min(int n1, int n2, int n3) {   return (int) Math.min(n1, Math.min(n2, n3)); }

573

asked Jul 06 '10 02:07

efficiencyIsBliss

1 Answers

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

102

answered Sep 25 '22 11:09

Alex Martelli

Related questions
                            
                                External Config Files with elmah
                            
                                Wpf UserControl and MVVM
                            
                                Why does CPU access memory on a word boundary?
                            
                                Int32.Parse() VS Convert.ToInt32()?
                            
                                Why use NSFetchedResultsController?
                            
                                How to mock request object for rspec helper tests?
                            
                                Maximum number of dimensions in a Java array
                            
                                How can I cat multiple files together into one without intermediary file? [closed]
                            
                                PHP Framework vs Content Management System
                            
                                Generic List of Generic Interfaces not allowed, any alternative approaches?
                            
                                Multi-Level Includes in CodeFirst - EntityFrameWork
                            
                                TEXT field that is compatible in mysql and hsqldb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With