Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest matching substring irrespective of the order of characters

I could not find an answer to the following hypothetical interview question:

Given two string sequences of length N, how can you find the maximum length of matching substrings irrespective of order.

For example, given seq1 = "ABCDEFG", and seq2 = "DBCAPFG", the maximum length window is 4. (ABCD from seq1 and DBCA from seq2).

like image 381
Amit Avatar asked Jun 08 '13 12:06

Amit


People also ask

How do you get the longest substring without repeating characters?

Run a loop from i = 0 till N – 1 and consider a visited array. Run a nested loop from j = i + 1 to N – 1 and check whether the current character S[j] has already been visited. If true, break from the loop and consider the next window. Else, mark the current character as visited and maximize the length as j – i + 1.

What is the time complexity for finding the longest substring?

To check if a substring is present in a string of a length of n, the time complexity for such operation is found to be O (n). The time complexity for finding the longest substring that is common in string S1 and S2 is Ɵ (n1 + n2).


1 Answers

Here is an O(n) solution (assuming fixed alphabet size and O(1) dictionary access).

Use a single frequency table that counts up for characters in seq1, and down for characters in seq2. We will have a matching window if this histogram ever takes same value again (as this means that we must have identical numbers of intermediate characters).

So we use a dictionary to store previous values for the histogram.

Python implementation:

def find_window(seq1,seq2):
    """Yield length of matching windows"""
    D={} # Dictionary with key=histogram, value = first position where this happens
    H=[0]*26 # H[x] is count of x in seq1 - count of x in seq2
    D[tuple(H)]=-1
    for i,(a,b) in enumerate(zip(seq1,seq2)):
        a=ord(a)-ord('A')
        b=ord(b)-ord('A') 
        H[a]+=1
        H[b]-=1
        key=tuple(H)
        if key in D:
            yield i-D[key]
        if key not in D:
            D[key]=i

print max(find_window("ABCDEFG","DBCAPFG"))

If you had a larger alphabet you could use a similar technique only storing a hash value instead of the full histogram. For example you could simply pick a random code for each character and add on the code for each letter in seq, or subtract for letters in seq2.

You would need a second pass to check that proposed matches were correct in case of a hash collision.

like image 114
Peter de Rivaz Avatar answered Sep 17 '22 12:09

Peter de Rivaz