Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find set of shortest subsequences with minimal collisions from set of strings

Tags:

I've got a list of strings like

  • Foobar
  • Foobaron
  • Foot
  • barstool
  • barfoo
  • footloose

I want to find the set of shortest possible sub-sequences that are unique to each string in the set; the characters in each sub-sequence do not need to be adjacent, just in order as they appear in the original string. For the example above, that would be (along other possibilities)

  • Fb (as unique to Foobar as it gets; collision with Foobaron unavoidable)
  • Fn (unique to Foobaron, no other ...F...n...)
  • Ft (Foot)
  • bs (barstool)
  • bf (barfoo)
  • e (footloose)

Is there an efficient way to mine such sequences and minimize the number of colliding strings (when collisions can't be avoided, e.g. when strings are substrings of other strings) from a given array of strings? More precisely, chosing the length N, what is the set of sub-sequences of up to N characters each that identify the original strings with the fewest number of collisions.

like image 994
user2722968 Avatar asked Apr 10 '17 21:04

user2722968


1 Answers

I would'nt really call that 'efficient', but you can do better than totally dumb like that:

words = ['Foobar', 'Foobaron', 'Foot', 'barstool', 'barfoo', 'footloose']
N = 2
n = len(words)
L = max([len(word) for word in words])

def generate_substrings(word, max_length=None):
    if max_length is None:
        max_length = len(word)
    set_substrings = set()
    set_substrings.add('')
    for charac in word:
        new_substr_list = []
        for substr in set_substrings:
            new_substr = substr + charac
            if len(new_substr) <= max_length:
                new_substr_list.append(new_substr)
        set_substrings.update(new_substr_list)
    return set_substrings

def get_best_substring_for_each(string_list=words, max_length=N):
    all_substrings = {}
    best = {}
    for word in string_list:
        for substring in generate_substrings(word, max_length=max_length):
            if substring not in all_substrings:
                all_substrings[substring] = 0
            all_substrings[substring] = all_substrings[substring] + 1
    for word in string_list:
        best_score = len(string_list) + 1
        best[word] = ''
        for substring in generate_substrings(word=word, max_length=max_length):
            if all_substrings[substring] < best_score:
                best[word] = substring
                best_score = all_substrings[substring]
    return best

print(get_best_substring_for_each(words, N))

This program prints the solution:

{'barfoo': 'af', 'Foobar': 'Fr', 'Foobaron': 'n', 'footloose': 'os', 'barstool': 'al', 'Foot': 'Ft'}

This can still be improved easily by a constant factor, for instance by storing the results of generate_substringsinstead of computing it twice.

The complexity is O(n*C(N, L+N)), where n is the number of words and L the maximum length of a word, and C(n, k) is the number of combinations with k elements out of n.

I don't think (not sure though) that you can do much better in the worst case, because it seems hard not to enumerate all possible substrings in the worst case (the last one to be evaluated could be the only one with no redundancy...). Maybe in average you can do better...

like image 167
gdelab Avatar answered Sep 24 '22 03:09

gdelab