Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another. Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N. Is there a more efficient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than <code>O(n^2)</code>? Is Levenshtein score a poor choice of algorithms for this problem?

This problem is known as clustering and is a part of a bigger deduplication problem (where you get to decide which member of the cluster is "the right" one), also known as merge-purge. I once read a few research papers on exactly this topic (the names are below) and basically, the authors used a limited-size sliding window over a sorted list of strings. They would only compare (using an edit distance algorithm) N*N strings inside the window, thereby reducing the computational complexity. If any two strings looked similar, they were combined into a cluster (by inserting a record into a separate cluster table). The first pass through the list was followed by a second pass where the strings were reversed before getting sorted. This way the strings with different heads had another chance to get close enough to be evaluated as part of the same window. On this second pass, if a string looked close enough to two (or more) strings in the window, and those strings were already parts of their own clusters (found by the first pass), the two clusters would then be merged (by updating the cluster table) and the current string would be added to the newly merged cluster. This clustering approach is known as the union-find algorithm. Then they improved the algorithm by replacing the window with a list of top X substantially unique prototypes. Each new string would be compared to each of the top X prototypes. If string looked close enough to one of the prototypes, it would then be added to the prototype's cluster. If none of the prototypes looked similar enough, the string would become a new prototype, pushing the oldest prototype out of the top X list. (There was an heuristic logic employed to decide which of the strings in the prototype's cluster should be used as the new prototype representing the entire cluster). Again, if the string looked similar to several prototypes, all of their clusters would be merged. I once implemented this algorithm for deduplication of name/address records with sizes of the lists being around 10-50 million records and it worked pretty damn fast (and found duplicates well too). Overall for such problems, the trickiest part of course is finding the right value of the similarity threshold. The idea is to capture all the dups w/o producing too many false positives. Data with different characteristics tends to require different thresholds. The choice of an edit-distance algorithm is also important as some algorithms are better for OCR errors while others are better for typos and yet others are better for phonetic errors (such as when getting a name over the phone). Once the clustering algorithm is implemented, a good way to test it is to get a list of unique samples and artificially mutate each sample to produce its variations, while preserving the fact that all the variations have come from the same parent. This list is then shuffled and fed to the algorithm. Comparing the original clustering with the clustering produced by the deduplication algorithm will give you the efficiency score. <h3>Bibliography:</h3> Hernandez M. 1995, The Merge/Purge Problem for Large Databases. Monge A. 1997, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records.

Efficient way of calculating likeness scores of strings when sample size is large?

Tags:

string

algorithm

complexity-theory

cluster-analysis

edit-distance

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list.

I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another.

Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N.

Is there a more efficient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than O(n^2)?

Is Levenshtein score a poor choice of algorithms for this problem?

426

asked Oct 22 '09 20:10

matt b

2 Answers

Yup - you can find all strings within a given distance of a string in O(log n) time by using a BK-Tree. Alternate solutions involving generating every string with distance n may be faster for a levenshtein distance of 1, but the amount of work rapidly balloons out of control for longer distances.

102

answered Sep 20 '22 12:09

Nick Johnson

This problem is known as clustering and is a part of a bigger deduplication problem (where you get to decide which member of the cluster is "the right" one), also known as merge-purge.

I once read a few research papers on exactly this topic (the names are below) and basically, the authors used a limited-size sliding window over a sorted list of strings. They would only compare (using an edit distance algorithm) N*N strings inside the window, thereby reducing the computational complexity. If any two strings looked similar, they were combined into a cluster (by inserting a record into a separate cluster table).

The first pass through the list was followed by a second pass where the strings were reversed before getting sorted. This way the strings with different heads had another chance to get close enough to be evaluated as part of the same window. On this second pass, if a string looked close enough to two (or more) strings in the window, and those strings were already parts of their own clusters (found by the first pass), the two clusters would then be merged (by updating the cluster table) and the current string would be added to the newly merged cluster. This clustering approach is known as the union-find algorithm.

Then they improved the algorithm by replacing the window with a list of top X substantially unique prototypes. Each new string would be compared to each of the top X prototypes. If string looked close enough to one of the prototypes, it would then be added to the prototype's cluster. If none of the prototypes looked similar enough, the string would become a new prototype, pushing the oldest prototype out of the top X list. (There was an heuristic logic employed to decide which of the strings in the prototype's cluster should be used as the new prototype representing the entire cluster). Again, if the string looked similar to several prototypes, all of their clusters would be merged.

I once implemented this algorithm for deduplication of name/address records with sizes of the lists being around 10-50 million records and it worked pretty damn fast (and found duplicates well too).

Overall for such problems, the trickiest part of course is finding the right value of the similarity threshold. The idea is to capture all the dups w/o producing too many false positives. Data with different characteristics tends to require different thresholds. The choice of an edit-distance algorithm is also important as some algorithms are better for OCR errors while others are better for typos and yet others are better for phonetic errors (such as when getting a name over the phone).

Once the clustering algorithm is implemented, a good way to test it is to get a list of unique samples and artificially mutate each sample to produce its variations, while preserving the fact that all the variations have come from the same parent. This list is then shuffled and fed to the algorithm. Comparing the original clustering with the clustering produced by the deduplication algorithm will give you the efficiency score.

Bibliography:

Hernandez M. 1995, The Merge/Purge Problem for Large Databases.

Monge A. 1997, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records.

answered Sep 19 '22 12:09

11 revs

Related questions
                            
                                Finding fastest path at additional condition
                            
                                Dependency Algorithm - find a minimum set of packages to install
                            
                                An efficient algorithm to count the number of integer grids
                            
                                Keeping track of the median of an expanding array
                            
                                Rolling median in C - Turlach implementation
                            
                                Floyd-Warshall: all shortest paths
                            
                                What is the state of the art in computer chess tree searching?
                            
                                Assembly mod algorithm on processor with no division operator
                            
                                Testing whether a polygon is simple or complex
                            
                                How to implement a dictionary (Trie vs HashTable and important issues)?
                            
                                Given a word, convert it into a palindrome with minimum addition of letters to it [closed]
                            
                                Calculate which products together would deliver the requested power
                            
                                What are useful ranking algorithms for documents without links?
                            
                                Number of all increasing subsequences in given sequence?
                            
                                Algorithm Challenge: Generate Continued Fractions for a float
                            
                                On complexity of recursive descent parsers
                            
                                Post-order Graph Traversal? [closed]
                            
                                Smart way to generate permutation and combination of String
                            
                                Detect black dots from color background
                            
                                Bitwise operator advantages in StringBuilder

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With