I have two lists of strings, and wish to find all pairs of strings between them that contain no common characters. e.g.
list1 = ['abc', 'cde']
list2 = ['aij', 'xyz', 'abc']
desired output = [('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]
I need this to be as efficient as possible because I am processing lists with millions of strings. At the moment, my code follows this general pattern:
output = []
for str1 in list1:
for str2 in list2:
if len(set(str1) & set(str2)) == 0:
output.append((str1, str2))
This is O(n^2) and is taking many hours to run, does anyone have some suggestions for how to speed this up? Perhaps there is a way to take advantage of characters in each string being sorted?
Thank you very much in advance!
Here's another tack, focusing on lowering the set operations to bit twiddling and combining words that represent the same set of letters:
import collections
import string
def build_index(words):
index = collections.defaultdict(list)
for word in words:
chi = sum(1 << string.ascii_lowercase.index(letter) for letter in set(word))
index[chi].append(word)
return index
def disjoint_pairs(words1, words2):
index1 = build_index(words1)
index2 = build_index(words2)
for chi1, words1 in index1.items():
for chi2, words2 in index2.items():
if chi1 & chi2:
continue
for word1 in words1:
for word2 in words2:
yield word1, word2
print(list(disjoint_pairs(["abc", "cde"], ["aij", "xyz", "abc"])))
Try this and tell me if there is any improvement:
import itertools
[i for i in itertools.product(list1, list2) if len(i[0]+i[1])==len(set(i[0]+i[1]))]
Output:
[('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With