I have a large list of 27,000 strings for which I have to find which 2 strings are similar. For this I used this python library --Levenshtein Library to find similarity between 2 strings. With the below code this is straight forward
count - 0
for i, coll1 in enumerate(college_list):
for j, coll2 in enumerate(college_list):
if abs(len(coll1) - len(coll2)) <= 15:
similarity = jaro(coll1,coll2)
if similarity >= 0.90 and similarity != 1.0 and :
print "Probable Similar Strings"
print coll1 + " AND " + coll2
print "Similarity is %s" % (similarity)
print "=" * 20
count += 1
But as you can see their are 2 for loops that will compare one string with another, and total number of such combinations are 729000000 (27000 X 27000).
My current code is taking a lot of time to complete, and I have to vary the similarity threshold to achieve the results which fits my use-case. Running multiple iteration of this code with different similarity threshold will definitely going to take a lot of time
Is their exists a nicer and a faster way to achieve the above functionality using numpy /pandas
Before thinking about switching to numpy, I think you should calculate the similarity only for j < i it will half the computation needed if Levenshtein similarity is a bijection.
See the example below : all the "/" don't need to be calculated, if jaro("aa","aa") == 1 and jaro("ab","aa") == jaro("aa","ab").
i/j aa ab ac
aa / 1 1
ab / / 1
ac / / /
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With