Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to compare two large sets of strings in Python

I am using Python (and have access to pandas, numpy, scipy).

I have two sets strings set A and set B. Each set A and B contains c. 2000 elements (each element being a string). The strings are around 50-100 characters long comprising up to c. 20 words (these sets may get much larger).

I wish to check if an member of set A is also a member of set B.

Now I am thinking a naive implementation can be visualised as a matrix where members in A and B are compared to one another (e.g. A1 == B1, A1 == B2, A1 == B3 and so on...) and the booleans (0, 1) from the comparison comprise the elements of the matrix.

What is the best way to implement this efficiently?

Two further elaborations:

(i) I am also thinking that for larger sets I may use a Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each string (i.e. I dont mind fasle positives so much...). Is this a good approach or are there other strategies I should consider?

(ii) I am thinking of including a Levenshtein distance match between strings (which I know can be slow) as I may need fuzzy matches - is there a way of combining this with the approach in (i) or otherwise making it more efficient?

Thanks in advance for any help!

like image 627
user7289 Avatar asked Jun 23 '13 17:06

user7289


People also ask

How do I compare bigger strings?

Java String compareTo() Method The method returns 0 if the string is equal to the other string. A value less than 0 is returned if the string is less than the other string (less characters) and a value greater than 0 if the string is greater than the other string (more characters).

What are the 3 ways to compare two string objects?

There are three ways to compare String in Java: By Using equals() Method. By Using == Operator. By compareTo() Method.


1 Answers

Firstly, 2000 * 100 chars is'nt that big, you could use a set directly.

Secondly, if your strings are sorted, there is a quick way (which I found here)to compare them, as follows:

def compare(E1, E2):
    i, j = 0, 0
    I, J = len(E1), len(E2)
    while i < I:
        if j >= J or E1[i] < E2[j]:
            print(E1[i], "is not in E2")
            i += 1
        elif E1[i] == E2[j]:
            print(E1[i], "is in E2")
            i, j = i + 1, j + 1
        else:
            j += 1

It is certainly slower than using a set, but it doesn't need the strings to be hold into memory (only two are needed at the same time).

For the Levenshtein thing, there is a C module which you can find on Pypi, and which is quite fast.

like image 67
michaelmeyer Avatar answered Sep 17 '22 23:09

michaelmeyer