Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given two sequences, find the maximal overlap between ending of one and beginning of the other

I need to find an efficient (pseudo)code to solve the following problem:

Given two sequences of (not necessarily distinct) integers (a[1], a[2], ..., a[n]) and (b[1], b[2], ..., b[n]), find the maximum d such that a[n-d+1] == b[1], a[n-d+2] == b[2], ..., and a[n] == b[d].

This is not homework, I actually came up with this when trying to contract two tensors along as many dimensions as possible. I suspect an efficient algorithm exists (maybe O(n)?), but I cannot come up with something that is not O(n^2). The O(n^2) approach would be the obvious loop on d and then an inner loop on the items to check the required condition until hitting the maximum d. But I suspect something better than this is possible.

like image 589
becko Avatar asked Feb 26 '20 19:02

becko


People also ask

How do you calculate overlap range?

To find the actual overlap range, you take the maximum of the two low ends, and the minimum of the two high ends: int e = Math. max(a,b); int f = Math. min(c,d); // overlapping range is [e,f], and overlap exists if e <= f.

How to check if two interval overlap?

1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.

How do you find overlapping time intervals in Python?

The basic idea is 1) first take input_start to test_start (if both of them are not equal and input_start is min) 2) always take test_start and test_end 3) take test_end to input_end if test_end is less than input end (and end_input and end_test are not equal).


3 Answers

You can utilize the z algorithm, a linear time (O(n)) algorithm that:

Given a string S of length n, the Z Algorithm produces an array Z where Z[i] is the length of the longest substring starting from S[i] which is also a prefix of S

You need to concatenate your arrays (b+a) and run the algorithm on the resulting constructed array till the first i such that Z[i]+i == m+n.

For example, for a = [1, 2, 3, 6, 2, 3] & b = [2, 3, 6, 2, 1, 0], the concatenation would be [2, 3, 6, 2, 1, 0, 1, 2, 3, 6, 2, 3] which would yield Z[10] = 2 fulfilling Z[i] + i = 12 = m + n.

like image 192
Amit Avatar answered Oct 23 '22 08:10

Amit


For O(n) time/space complexity, the trick is to evaluate hashes for each subsequence. Consider the array b:

[b1 b2 b3 ... bn]

Using Horner's method, you can evaluate all the possible hashes for each subsequence. Pick a base value B (bigger than any value in both of your arrays):

from b1 to b1 = b1 * B^1
from b1 to b2 = b1 * B^1 + b2 * B^2
from b1 to b3 = b1 * B^1 + b2 * B^2 + b3 * B^3
...
from b1 to bn = b1 * B^1 + b2 * B^2 + b3 * B^3 + ... + bn * B^n

Note that you can evaluate each sequence in O(1) time, using the result of the previous sequence, hence all the job costs O(n).

Now you have an array Hb = [h(b1), h(b2), ... , h(bn)], where Hb[i] is the hash from b1 until bi.

Do the same thing for the array a, but with a little trick:

from an to an   =  (an   * B^1)
from an-1 to an =  (an-1 * B^1) + (an * B^2)
from an-2 to an =  (an-2 * B^1) + (an-1 * B^2) + (an * B^3)
...
from a1 to an   =  (a1   * B^1) + (a2 * B^2)   + (a3 * B^3) + ... + (an * B^n)

You must note that, when you step from one sequence to another, you multiply the whole previous sequence by B and add the new value multiplied by B. For example:

from an to an =    (an   * B^1)

for the next sequence, multiply the previous by B: (an * B^1) * B = (an * B^2)
now sum with the new value multiplied by B: (an-1 * B^1) + (an * B^2) 
hence:

from an-1 to an =  (an-1 * B^1) + (an * B^2)

Now you have an array Ha = [h(an), h(an-1), ... , h(a1)], where Ha[i] is the hash from ai until an.

Now, you can compare Ha[d] == Hb[d] for all d values from n to 1, if they match, you have your answer.


ATTENTION: this is a hash method, the values can be large and you may have to use a fast exponentiation method and modular arithmetics, which may (hardly) give you collisions, making this method not totally safe. A good practice is to pick a base B as a really big prime number (at least bigger than the biggest value in your arrays). You should also be careful as the limits of the numbers may overflow at each step, so you'll have to use (modulo K) in each operation (where K can be a prime bigger than B).

This means that two different sequences might have the same hash, but two equal sequences will always have the same hash.

like image 41
Daniel Avatar answered Oct 23 '22 09:10

Daniel


This can indeed be done in linear time, O(n), and O(n) extra space. I will assume the input arrays are character strings, but this is not essential.

A naive method would -- after matching k characters that are equal -- find a character that does not match, and go back k-1 units in a, reset the index in b, and then start the matching process from there. This clearly represents a O(n²) worst case.

To avoid this backtracking process, we can observe that going back is not useful if we have not encountered the b[0] character while scanning the last k-1 characters. If we did find that character, then backtracking to that position would only be useful, if in that k sized substring we had a periodic repetition.

For instance, if we look at substring "abcabc" somewhere in a, and b is "abcabd", and we find that the final character of b does not match, we must consider that a successful match might start at the second "a" in the substring, and we should move our current index in b back accordingly before continuing the comparison.

The idea is then to do some preprocessing based on string b to log back-references in b that are useful to check when there is a mismatch. So for instance, if b is "acaacaacd", we could identify these 0-based backreferences (put below each character):

index: 0 1 2 3 4 5 6 7 8
b:     a c a a c a a c d
ref:   0 0 0 1 0 0 1 0 5

For example, if we have a equal to "acaacaaca" the first mismatch happens on the final character. The above information then tells the algorithm to go back in b to index 5, since "acaac" is common. And then with only changing the current index in b we can continue the matching at the current index of a. In this example the match of the final character then succeeds.

With this we can optimise the search and make sure that the index in a can always progress forwards.

Here is an implementation of that idea in JavaScript, using the most basic syntax of that language only:

function overlapCount(a, b) {
    // Deal with cases where the strings differ in length
    let startA = 0;
    if (a.length > b.length) startA = a.length - b.length;
    let endB = b.length;
    if (a.length < b.length) endB = a.length;
    // Create a back-reference for each index
    //   that should be followed in case of a mismatch.
    //   We only need B to make these references:
    let map = Array(endB);
    let k = 0; // Index that lags behind j
    map[0] = 0;
    for (let j = 1; j < endB; j++) {
        if (b[j] == b[k]) {
            map[j] = map[k]; // skip over the same character (optional optimisation)
        } else {
            map[j] = k;
        }
        while (k > 0 && b[j] != b[k]) k = map[k]; 
        if (b[j] == b[k]) k++;
    }
    // Phase 2: use these references while iterating over A
    k = 0;
    for (let i = startA; i < a.length; i++) {
        while (k > 0 && a[i] != b[k]) k = map[k];
        if (a[i] == b[k]) k++;
    }
    return k;
}

console.log(overlapCount("ababaaaabaabab", "abaababaaz")); // 7

Although there are nested while loops, these do not have more iterations in total than n. This is because the value of k strictly decreases in the while body, and cannot become negative. This can only happen when k++ was executed that many times to give enough room for such decreases. So all in all, there cannot be more executions of the while body than there are k++ executions, and the latter is clearly O(n).

To complete, here you can find the same code as above, but in an interactive snippet: you can input your own strings and see the result interactively:

function overlapCount(a, b) {
    // Deal with cases where the strings differ in length
    let startA = 0;
    if (a.length > b.length) startA = a.length - b.length;
    let endB = b.length;
    if (a.length < b.length) endB = a.length;
    // Create a back-reference for each index
    //   that should be followed in case of a mismatch.
    //   We only need B to make these references:
    let map = Array(endB);
    let k = 0; // Index that lags behind j
    map[0] = 0;
    for (let j = 1; j < endB; j++) {
        if (b[j] == b[k]) {
            map[j] = map[k]; // skip over the same character (optional optimisation)
        } else {
            map[j] = k;
        }
        while (k > 0 && b[j] != b[k]) k = map[k]; 
        if (b[j] == b[k]) k++;
    }
    // Phase 2: use these references while iterating over A
    k = 0;
    for (let i = startA; i < a.length; i++) {
        while (k > 0 && a[i] != b[k]) k = map[k];
        if (a[i] == b[k]) k++;
    }
    return k;
}

// I/O handling

let [inputA, inputB] = document.querySelectorAll("input");
let output = document.querySelector("pre");

function refresh() {
    let a = inputA.value;
    let b = inputB.value;
    let count = overlapCount(a, b);
    let padding = a.length - count;
    // Apply some HTML formatting to highlight the overlap:
    if (count) {
        a = a.slice(0, -count) + "<b>" + a.slice(-count) + "</b>";
        b = "<b>" + b.slice(0, count) + "</b>" + b.slice(count);
    }
    output.innerHTML = count + " overlapping characters:\n" + 
                         a + "\n" + 
                         " ".repeat(padding) + b;
}

document.addEventListener("input", refresh);
refresh();
body { font-family: monospace }
b { background:yellow }
input { width: 90% }
a: <input value="acacaacaa"><br>
b: <input value="acaacaacd"><br>
<pre></pre>
like image 34
trincot Avatar answered Oct 23 '22 08:10

trincot