Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimized way to find similar strings in a collection

I have a large list of 27,000 strings for which I have to find which 2 strings are similar. For this I used this python library --Levenshtein Library to find similarity between 2 strings. With the below code this is straight forward

count - 0
for i, coll1 in enumerate(college_list):
    for j, coll2 in enumerate(college_list):
        if abs(len(coll1) - len(coll2)) <= 15:
            similarity = jaro(coll1,coll2)
            if similarity >= 0.90 and similarity != 1.0 and :
                print "Probable Similar Strings"
                print coll1 + " AND " + coll2
                print "Similarity is %s" % (similarity)
                print "=" * 20
                count += 1

But as you can see their are 2 for loops that will compare one string with another, and total number of such combinations are 729000000 (27000 X 27000).

My current code is taking a lot of time to complete, and I have to vary the similarity threshold to achieve the results which fits my use-case. Running multiple iteration of this code with different similarity threshold will definitely going to take a lot of time

Is their exists a nicer and a faster way to achieve the above functionality using numpy /pandas

like image 592
Anurag Sharma Avatar asked Feb 11 '23 00:02

Anurag Sharma


1 Answers

Before thinking about switching to numpy, I think you should calculate the similarity only for j < i it will half the computation needed if Levenshtein similarity is a bijection.

See the example below : all the "/" don't need to be calculated, if jaro("aa","aa") == 1 and jaro("ab","aa") == jaro("aa","ab").

i/j aa ab ac
aa   /  1  1
ab   /  /  1
ac   /  /  /
like image 199
Pierre.Sassoulas Avatar answered Feb 13 '23 21:02

Pierre.Sassoulas