I could not find an answer to the following hypothetical interview question:
Given two string sequences of length N, how can you find the maximum length of matching substrings irrespective of order.
For example, given seq1 = "ABCDEFG"
, and seq2 = "DBCAPFG"
, the maximum length window is 4. (ABCD
from seq1
and DBCA
from seq2
).
Run a loop from i = 0 till N – 1 and consider a visited array. Run a nested loop from j = i + 1 to N – 1 and check whether the current character S[j] has already been visited. If true, break from the loop and consider the next window. Else, mark the current character as visited and maximize the length as j – i + 1.
To check if a substring is present in a string of a length of n, the time complexity for such operation is found to be O (n). The time complexity for finding the longest substring that is common in string S1 and S2 is Ɵ (n1 + n2).
Here is an O(n) solution (assuming fixed alphabet size and O(1) dictionary access).
Use a single frequency table that counts up for characters in seq1, and down for characters in seq2. We will have a matching window if this histogram ever takes same value again (as this means that we must have identical numbers of intermediate characters).
So we use a dictionary to store previous values for the histogram.
Python implementation:
def find_window(seq1,seq2):
"""Yield length of matching windows"""
D={} # Dictionary with key=histogram, value = first position where this happens
H=[0]*26 # H[x] is count of x in seq1 - count of x in seq2
D[tuple(H)]=-1
for i,(a,b) in enumerate(zip(seq1,seq2)):
a=ord(a)-ord('A')
b=ord(b)-ord('A')
H[a]+=1
H[b]-=1
key=tuple(H)
if key in D:
yield i-D[key]
if key not in D:
D[key]=i
print max(find_window("ABCDEFG","DBCAPFG"))
If you had a larger alphabet you could use a similar technique only storing a hash value instead of the full histogram. For example you could simply pick a random code for each character and add on the code for each letter in seq, or subtract for letters in seq2.
You would need a second pass to check that proposed matches were correct in case of a hash collision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With