Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster C# (or other .NET) Levenshtein distance implementation

Good night,

I have been working with fuzzy string matching for some time now, and using C with some pointers I could write a very fast (for my needs) implementation of the Levenshtein distance between two strings. I tried to port the code to C# using unsafe code and the fixed keyword, but the performance was way slower. So I chose to build a C++ dll and use [DllImport] from C#, automatically marshalling every string. The problem is that, after profiling, this keeps being the most time-consuming part of my program, taking between 50-57% of the total running time of the program. Since I think I will need to do some heavy work with lots of substrings of a text field coming from some 3-million database records, I think the time the Levenshtein distance is taking is almost unacceptable. That being, I would like to know if you have any suggestions, both algorithmic or programming-related, to the code below, or if you know of any better algorithm to calculate this distance?

#define Inicio1  (*(BufferVar))
#define Inicio2  (*(BufferVar+1))
#define Fim1  (*(BufferVar+2))
#define Fim2  (*(BufferVar+3))
#define IndLinha (*(BufferVar+4))
#define IndCol  (*(BufferVar+5))
#define CompLinha (*(BufferVar+6))
#define TamTmp  (*(BufferVar+7))

int __DistanciaEdicao (char * Termo1, char * Termo2, int TamTermo1, int TamTermo2, int * BufferTab, int * BufferVar)
{
 *(BufferVar) = *(BufferVar + 1) = 0;
    *(BufferVar + 2) = TamTermo1 - 1;
    *(BufferVar + 3) = TamTermo2 - 1;

 while ((Inicio1 <= *(BufferVar + 2)) && (Inicio2 <= *(BufferVar + 3)) && *(Termo1 + Inicio1) == *(Termo2 + Inicio2))
  Inicio1 = ++Inicio2;

  if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);

 while ((Fim1 >= 0) && (Fim2 >= 0) && *(Termo1 + Fim1) == *(Termo2 + Fim2))
  { Fim1--; Fim2--;}

  if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);

 TamTermo1 = Fim1 - Inicio1 + 1;
 TamTermo2 = Fim2 - Inicio2 + 1;
 CompLinha = ((TamTermo1 > TamTermo2) ? TamTermo1 : TamTermo2) + 1;

 for (IndLinha = 0; IndLinha <= TamTermo2; *(BufferTab + CompLinha * IndLinha) = IndLinha++);
 for (IndCol = 0; IndCol <= TamTermo1; *(BufferTab + IndCol) = IndCol++);

 for (IndCol = 1; IndCol <= TamTermo1; IndCol++)
  for (IndLinha = 1; IndLinha <= TamTermo2; IndLinha++)
   *(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);

 return *(BufferTab + CompLinha * TamTermo2 + TamTermo1);
}

Please note that BufferVar and BufferTab are two external int * (in th case, int[] variables being marshalled from C#) which I do not instantiate in every function call to make the whole process faster. Still, this code is pretty slow for my needs. Can anyone please give me some suggestions, or, if possible, provide some better code?

Edit: The distance can't be bounded, I need the actual distance.

Thank you very much,

like image 885
Miguel Avatar asked Dec 21 '22 20:12

Miguel


2 Answers

1. Brute Force

Here is an implementation of the Levenshtein Distance in Python.

def levenshtein_matrix(lhs, rhs):
  def move(index): return (index+1)%2

  m = len(lhs)
  n = len(rhs)

  states = [range(n+1), [0,]*(n+1)]

  previous = 0
  current = 1

  for i in range(1, m+1):
    states[current][0] = i

    for j in range(1,n+1):
      add = states[current][j-1] + 1
      sub = states[previous][j] + 1
      repl = states[previous][j-1] + abs(cmp(lhs[i-1], rhs[j-1]))

      states[current][j] = min( repl, min(add,sub) )

    previous = move(previous)
    current = move(current)

  return states[previous][n]

It's the typical dynamic programming algorithm, just taking advantage that since one only need the last row, keeping only two rows at a time is sufficient.

For a C++ implementation, you might look at LLVM's one (line 70-130), note the use of a stack allocated array of fixed size, replaced only when necessary by a dynamically allocated array.

I just can't follow up your code to try and diagnose it... so let's change the angle of attack. Instead of micro-optimizing the distance, we'll change the algorithm altogether.

2. Doing better: using a Dictionary

One of the issue you face is that you could do much better.

The first remark is that the distance is symmetric, though it doesn't change the overall complexity it will halve the time necessary.

The second is that since you actually have a dictionary of known words, you can build on that: "actor" and "actual" share a common prefix ("act") and thus you need not recompute the first stages.

This can be exploited using a Trie (or any other sorted structure) to store your words. Next you will take one word, and compute its distance relatively to all of the words stored in the dictionary, taking advantage of the prefixes.

Let's take an example dic = ["actor", "actual", "addict", "atchoum"] and we want to compute the distance for word = "atchoum" (we remove it from the dictionary at this point)

  1. Initialize the matrix for the word "atchoum": matrix = [[0, 1, 2, 3, 4, 5, 6, 7]]
  2. Pick the next word "actor"
  3. Prefix = "a", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6]]
  4. Prefix = "ac", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6], [2, 1, 1, 2, 3, 4, 5, 6]]
  5. Prefix = "act", matrix = [[..], [..], [..], [..]]
  6. Continue until "actor", you have your distance
  7. Pick the next word "actual", rewind the matrix until the prefix is a prefix of our word, here up to "act"
  8. Prefix = "actu", matrix = [[..], [..], [..], [..], [..]]
  9. Continue until "actual"
  10. Continue for the other words

What's important here is the rewind step, by preserving the computation done for the previous word, with which you share a good-length prefix, you effectively save a lot of work.

Note that this is trivially implemented with a simple stack and does not require any recursive call.

like image 179
Matthieu M. Avatar answered Dec 31 '22 00:12

Matthieu M.


Try the simple approach first - don't use pointers and unsafe code - just code plain ordinary C#... but use the correct algorithm.

There is a simple and efficient algorithm on Wikipedia that uses dynamic programming and runs O(n*m) where n and m are the lengths of the inputs. I suggest you try implementing that algorithm first, as it is described there and only start optimizing it after you've implemented it, measured the performance and found it to be insufficient.

See also the section Possible improvements where it says:

By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small

If I had to guess where the problem is I'd probably start by looking at this line that runs inside two loops:

*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);

There appears to be a lot of duplication there though it's hard for me to spot exactly what's going on. Could you factor some of that out? And you definitely need to make it more readable.

like image 33
Mark Byers Avatar answered Dec 31 '22 00:12

Mark Byers