I am using Python (and have access to pandas, numpy, scipy).
I have two sets strings set A and set B. Each set A and B contains c. 2000 elements (each element being a string). The strings are around 50-100 characters long comprising up to c. 20 words (these sets may get much larger).
I wish to check if an member of set A is also a member of set B.
Now I am thinking a naive implementation can be visualised as a matrix where members in A and B are compared to one another (e.g. A1 == B1, A1 == B2, A1 == B3 and so on...) and the booleans (0, 1) from the comparison comprise the elements of the matrix.
What is the best way to implement this efficiently?
Two further elaborations:
(i) I am also thinking that for larger sets I may use a Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each string (i.e. I dont mind fasle positives so much...). Is this a good approach or are there other strategies I should consider?
(ii) I am thinking of including a Levenshtein distance match between strings (which I know can be slow) as I may need fuzzy matches - is there a way of combining this with the approach in (i) or otherwise making it more efficient?
Thanks in advance for any help!
Java String compareTo() Method The method returns 0 if the string is equal to the other string. A value less than 0 is returned if the string is less than the other string (less characters) and a value greater than 0 if the string is greater than the other string (more characters).
There are three ways to compare String in Java: By Using equals() Method. By Using == Operator. By compareTo() Method.
Firstly, 2000 * 100 chars is'nt that big, you could use a set directly.
Secondly, if your strings are sorted, there is a quick way (which I found here)to compare them, as follows:
def compare(E1, E2):
i, j = 0, 0
I, J = len(E1), len(E2)
while i < I:
if j >= J or E1[i] < E2[j]:
print(E1[i], "is not in E2")
i += 1
elif E1[i] == E2[j]:
print(E1[i], "is in E2")
i, j = i + 1, j + 1
else:
j += 1
It is certainly slower than using a set, but it doesn't need the strings to be hold into memory (only two are needed at the same time).
For the Levenshtein thing, there is a C module which you can find on Pypi, and which is quite fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With