Finding groups of similar strings in a large set of strings

Tags:

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

As an example let's say the input list is on the left below, and the output groups are on the right.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith

Does anybody know of any ways to solve this reasonably efficiently?

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

Does anybody have any thoughts/pointers?

UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

908

asked Jul 25 '10 13:07

latentflip

3 Answers

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

127

answered Oct 16 '22 23:10

Nordic Mainframe

The problem you're trying to solve is a typical clusterization problem.

Start with simple K-Means algorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.

BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance

The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).

answered Oct 16 '22 22:10

Roman

If we're talking about actual pronouncable words, comparing the (start of) their metaphone might be of assistance:

MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts   
FLPRBRTS: Philip Roberts 
FMKBR: Foo McBar      
TFTJNS: David Jones    
TFJNS: Dave Jones     
TFJNS: Davey Jones    
JNT: Jane Doe       
JNSM0: John Smith     
JNSM0: Jonny Smith

answered Oct 16 '22 23:10

Wrikken

Related questions
                            
                                Difference between average case and amortized analysis
                            
                                Water collected between towers
                            
                                Pointers to some good SVM Tutorial [closed]
                            
                                find all subsets that sum to a particular value
                            
                                Euclidean distance vs Pearson correlation vs cosine similarity?
                            
                                Quicksort superiority over Heap Sort
                            
                                What is image hashing used for?
                            
                                Random points inside a parallelogram
                            
                                Smoothing values over time: moving average or something better?
                            
                                Phonetically Memorable Password Generation Algorithms
                            
                                Tetris Piece Rotation Algorithm
                            
                                Compute the minimal number of swaps to order a sequence
                            
                                Python k-means algorithm
                            
                                Improve algorithmic thinking [closed]
                            
                                How to calculate rounded corners for a polygon?
                            
                                Why in a heap implemented by array the index 0 is left unused?
                            
                                Hash table vs Balanced binary tree [closed]
                            
                                Probability of collision when using a 32 bit hash
                            
                                What is a good solution for calculating an average where the sum of all values exceeds a double's limits?
                            
                                `std::list<>::sort()` - why the sudden switch to top-down strategy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding groups of similar strings in a large set of strings

Tags:

string

algorithm

design-patterns