Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all pairs of strings in two lists that contain no common characters

I have two lists of strings, and wish to find all pairs of strings between them that contain no common characters. e.g.

list1 = ['abc', 'cde']
list2 = ['aij', 'xyz', 'abc']

desired output = [('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]

I need this to be as efficient as possible because I am processing lists with millions of strings. At the moment, my code follows this general pattern:

output = []

for str1 in list1:    
    for str2 in list2:
        if len(set(str1) & set(str2)) == 0: 
             output.append((str1, str2))

This is O(n^2) and is taking many hours to run, does anyone have some suggestions for how to speed this up? Perhaps there is a way to take advantage of characters in each string being sorted?

Thank you very much in advance!

like image 752
Cameron Chandler Avatar asked Sep 26 '20 13:09

Cameron Chandler


2 Answers

Here's another tack, focusing on lowering the set operations to bit twiddling and combining words that represent the same set of letters:

import collections
import string


def build_index(words):
    index = collections.defaultdict(list)
    for word in words:
        chi = sum(1 << string.ascii_lowercase.index(letter) for letter in set(word))
        index[chi].append(word)
    return index


def disjoint_pairs(words1, words2):
    index1 = build_index(words1)
    index2 = build_index(words2)
    for chi1, words1 in index1.items():
        for chi2, words2 in index2.items():
            if chi1 & chi2:
                continue
            for word1 in words1:
                for word2 in words2:
                    yield word1, word2


print(list(disjoint_pairs(["abc", "cde"], ["aij", "xyz", "abc"])))
like image 177
David Eisenstat Avatar answered Oct 27 '22 01:10

David Eisenstat


Try this and tell me if there is any improvement:

import itertools

[i for i in itertools.product(list1, list2) if len(i[0]+i[1])==len(set(i[0]+i[1]))]

Output:

[('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]
like image 21
IoaTzimas Avatar answered Oct 26 '22 23:10

IoaTzimas