Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All-to-All comparison of two lists in Python

I'm struggling with some performance complications. The task in hand is to extract the similarity value between two strings. For this I am using fuzzywuzzy:

from fuzzywuzzy import fuzz

print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38

However, this is OK. The problem that I'm facing is that I have two lists, one has 1500 rows and the other several thousand. I need to compare all elements of the first agains all elements of the second one. Simple for in a for loop will take ridiculously big amount of time to compute.

If anyone has a suggestion how can I speed this up, it would be highly appreciated.

like image 586
VnC Avatar asked Sep 19 '25 21:09

VnC


2 Answers

I've made something on my own for you (python 2.7):

from __future__ import division

import time
from itertools import izip

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += len(word)
    if counter:
        return counter/len(' '.join(s_bigger))*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

So I think mine is faster, but more accurate for you? You decide.

EDIT Now is not only faster, but also more accurate.

Result:

match:  41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match:  50
fuzzy: --- 7.39097595215e-05 seconds ---

<simliar or similar>/<length of bigger>*100%
41.1764705882

Equals?
True

Faster than fuzzy?
True

Of course if you would have words check like fuzzywuzzy does, then here you go:

from __future__ import division
from itertools import izip
import time

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += 1
    if counter:
        return counter/len(s_bigger)*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

Result:

match:  50.0
compare: --- 7.20024108887e-05 seconds ---
match:  50
fuzzy: --- 0.000125169754028 seconds ---

Equals?
True

Faster than fuzzy?
True
like image 73
turkus Avatar answered Sep 21 '25 10:09

turkus


If you need to count the number of times each of the statements appear then no, there is no way I know of to get a huge speedup over the n^2 operations needed to compare the elements in each list. You might be able to avoid some string matching by using the length to rule out the possibility that a match could occur but you still have nested for loops. You would probably spend far more time optimizing it than the amount of processing time it would save you.

like image 28
qfwfq Avatar answered Sep 21 '25 12:09

qfwfq