Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate.

We also want to account for typos. So we started to think about on average how often do people make typos online per word, and try to use that data within this distance. We could not find any such statistic.

Is there any way to account for typos when creating this sort of threshold for a match of data?

Let me know if I can clarify!

like image 782
Parris Avatar asked Jul 27 '10 03:07

Parris


People also ask

How do you use Levenshtein distance?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

How do you normalize Levenshtein distance?

To quantify the similarity, we normalize the edit distance. One approach is to calculate edit distance as usual and then divide it by the number of operations, usually called the length of the edit path.

How do you check for string similarity?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

What is damerau Levenshtein distance used for?

While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.


1 Answers

First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)

It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.

If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:

  • Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
  • abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.

One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.

Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.

like image 181
David Alves Avatar answered Nov 10 '22 07:11

David Alves