How can I create a threshold for similar strings using Levenshtein distance and account for typos?

Tags:

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate.

We also want to account for typos. So we started to think about on average how often do people make typos online per word, and try to use that data within this distance. We could not find any such statistic.

Is there any way to account for typos when creating this sort of threshold for a match of data?

Let me know if I can clarify!

782

asked Jul 27 '10 03:07

Parris

1 Answers

First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)

It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.

If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:

Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.

One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.

Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.

181

answered Nov 10 '22 07:11

David Alves

Related questions
                            
                                Proper mechanism for sending PHP errors to the client
                            
                                How to convert date and time into unix timestamp in php?
                            
                                How to make PHP Curl request not to wait?
                            
                                How to use two forms and submit once?
                            
                                Anyone know any good backend user online file manager? [closed]
                            
                                What should a PHP/MySQL web developer know about servers?
                            
                                File download in IE6 gets the wrong filename
                            
                                Retrieve Domain Name instead of IP
                            
                                Sessions and cookies
                            
                                How to create index faster?
                            
                                What exactly does it mean when $_FILES is empty?
                            
                                Is it good practice to display the session id in the url?
                            
                                Ideal way of dealing with Solr results in PHP?
                            
                                How to find crc32 of big files?
                            
                                Cutting down a length of a PHP string and inserting an ellipses
                            
                                Web technologies for an embedded server
                            
                                Recognizing a character to be Chinese and get Chinese "pinyin" phonetics from simplified characters?
                            
                                select query for time interval in mysql and php
                            
                                CRUD Class for PHP (Simple but will Allow Relational Data)
                            
                                using conditions in magento layout xml

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

Tags:

php

mysql

puzzle

levenshtein-distance

Parris

People also ask

1 Answers

David Alves

Recent Activity

Donate For Us