Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding a substring, with some additional conditions

Tags:

algorithm

I'm given a string which looks like this:

1011010100

And my task is to find the length of a substring which number of nulls is always <= number of ones. And this should always happen while 'scanning' substring from right to left and from left to right. So in this example, the answer would be:

10110101 => 8

I know that the complexity should be either O(n) or O(n log n), because length can reach up to 10^6

Any ideas?

like image 550
adam jackson Avatar asked Oct 20 '13 17:10

adam jackson


People also ask

How do you find all occurrences of a substring?

Find all occurrences of a substring using finditer() The finditer() method is available in the re module which is used to return the starting index of a particular substring using the start() method. If we want to return all occurrence indices, we have to use list comprehension and iterate the string using an iterator.

How do I find a particular substring in a string in Java?

To locate a substring in a string, use the indexOf() method. Let's say the following is our string. String str = "testdemo"; Find a substring 'demo' in a string and get the index.

How do you check in Python if a string contains a substring?

The easiest and most effective way to see if a string contains a substring is by using if ... in statements, which return True if the substring is detected. Alternatively, by using the find() function, it's possible to get the index that a substring starts at, or -1 if Python can't find the substring.

How do you check if a string is a substring of another in C?

The function strstr returns the first occurrence of a string in another string. This means that strstr can be used to detect whether a string contains another string. In other words, whether a string is a substring of another string.


2 Answers

The O(n) solution is quite simple actually, by building the "height array", representing the number of 1's relative to number of 0's. So a height of 2 means there are 2 more 1's than 0's. The we iterate over the height array once performing some maximality checking with some conditions.

Crucial Observation

Note that a subarray fulfilling the conditions must have its height minimum at the beginning, and maximum at the end (as relative to the subarray, not the whole array).

The height array for the sample in the question can be plotted like this, with marked answer:

       v
   /\/\/\
/\/      \
^

Proof:

Suppose the height is not minimum at the beginning, that means there is a point inside the subarray where the height is lower than the beginning. At this point, the number of 0 should be larger than the number of 1. Contradiction.

Suppose the height is not maximum at the end, that means there is a point in the subarray where the height is larger than the end, say at index j. Then at index j to the end there are more 0 than 1 (since the height decreases), and so when we "scan" the subarray from right to left we will find more 0 than 1 at index j. Contradiction.

Algorithm

Now the problem can be interpreted as finding the longest subarray which ends with the highest height in the subarray, while keeping the minimum to not exceed the height at the beginning. This is very similar to maximum subarray problem like mentioned by klrmlr ("contiguous subsequence of an array" is better said as "subarray"). And the idea is not keeping an O(n) state, but rather keeping the "maximum so far" and "maximum at this point".

Following that algorithm, below is the pseudocode, traversing the array once:

Procedure Balance_Left_Right

  1. Record the lowest and highest point so far
  2. If the height at this point is lower than the lowest point so far, then change the starting point to the index after this point
  3. If the height at this point is higher or equal to the highest point so far, then this is a valid subarray, record the length (and start and end indices, if you like)

However we will soon see a problem (as pointed by Adam Jackson through personal communication) for this test case: 1100101, visualized as follows:

 /\
/  \/\/

The correct answer is 3 (the last 101), but the above algorithm will get 2 (the first 11). This is because our answer is apparently hidden behind a "high mountain" (i.e., the lowest point in the answer is not lower than the mountain, and the highest point in the answer is not higher than the mountain).

And so we need to ensure that when we run the Procedure Balance_Left_Right (above), there is no "high mountain" hiding the answer. And so the solution is to traverse the array once from the right, try to partition the array into multiple sections where in each section, this property holds: "the number of 1's is always >= the number of 0's, as traversed from the right", and also for each section, it can't be extended to the left anymore.

Then, in each section, when traversed from the left, will have the maximum height at the end of the section, and this is the maximum. And it can be proven that with this property, the method balance_left_right will find the correct answer for this section. So, we just call our balance_left_right method on each section, and then take the maximum answer among those.

Now, you may ask, why it's sufficient to run Balance_Left_Right on each section? This is because the answer requires the property to hold from the left and from the right, and so it must lies inside one of the sections, since each of the section satisfies half of the property.

The algorithm is still O(n) because we only visit each element twice, once from the right, and once from the left.

The last test case will be partitioned as follows:

 /|\ |
/ | \|/\/
**    ***

where only the sections marked with asterisk (*) are taken.

So the new algorithm is as follows:

Procedure Max_Balance_Left_Right

  1. Partition the input where the number of 1 >= number of 0 from the right (Using Balance_Left from the right, or can call it Balance_right)
  2. Run Balance_Left_Right on each partition
  3. Take the maximum

Here's the code in Python:

def balance_left_right(arr):
    lower = 0
    upper = -2**32
    lower_idx = 0 # Optional
    upper_idx = -1 # Optional
    result = (0,0,0)
    height = 0
    length = 0
    for idx, num in enumerate(arr):
        length += 1
        height += 1 if num==1 else -1
        if height<lower:
            lower = height # Reset the lowest
            upper = height # Reset the highest
            lower_idx = idx+1 # Optional, record the starting point
            length = 0 # Reset the answer
        if height>=upper:
            upper = height
            upper_idx = idx # Optional, record the end point
            if length > result[0]: # Take maximum length
                result = (length, lower_idx, upper_idx)
    return result

def max_balance_left_right(arr):
    all_partitions = []
    start = 0
    end = len(arr)
    right_partitions = balance_left(reversed(arr[start:end]))
    for right_start, right_end in right_partitions:
        all_partitions.append((end-right_end, end-right_start))
    result = (0,0,0)
    for start, end in all_partitions:
        candidate = balance_left_right(arr[start:end])
        if result[0] < candidate[0]:
            result = (candidate[0], candidate[1]+start, candidate[2]+start)
    return result

def balance_left(arr):
    lower = 0
    start_idx = 0
    end_idx = -1
    height = 0
    result = []
    for idx, num in enumerate(arr):
        height += 1 if num==1 else -1
        if height < lower:
            if end_idx != -1:
                result.append((start_idx,end_idx))
            lower = height
            start_idx = idx+1
            end_idx = -1
        else:
            end_idx = idx+1
    if end_idx != -1:
        result.append((start_idx, end_idx))
    return result

test_cases = [
        [1,0,1,1,0,1,0,1,0,0],
        [0,0,1,0,1,0,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1],
        [1,1,1,0,0,0,1,0,0,1,1,0,1,1,0,1,1,0],
        [1,1,0,0,1,0,1,0,1,1,0,0,1,0,0],
        [1,1,0,0,1,0,1],
        [1,1,1,1,1,0,0,0,1,0,1,0,1,1,0,0,1,0,0,1,0,1,1]
        ]
for test_case in test_cases:
    print 'Balance left right:'
    print test_case
    print balance_left_right(test_case)
    print 'Max balance left right:'
    print test_case
    print max_balance_left_right(test_case)
    print

which will print:

Balance left right:
[1, 0, 1, 1, 0, 1, 0, 1, 0, 0]
(8, 0, 7)
Max balance left right:
[1, 0, 1, 1, 0, 1, 0, 1, 0, 0]
(8, 0, 7)

Balance left right:
[0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1]
(6, 12, 17)
Max balance left right:
[0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1]
(6, 12, 17)

Balance left right:
[1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0]
(8, 9, 16)
Max balance left right:
[1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0]
(8, 9, 16)

Balance left right:
[1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0]
(10, 0, 9)
Max balance left right:
[1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0]
(10, 0, 9)

Balance left right:
[1, 1, 0, 0, 1, 0, 1]
(2, 0, 1)
Max balance left right:
[1, 1, 0, 0, 1, 0, 1]
(3, 4, 6)

Balance left right:
[1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1]
(5, 0, 4)
Max balance left right:
[1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1]
(6, 8, 13)

For your eyes enjoyment, the height array for the test cases:

First:
       v
   /\/\/\
/\/      \
^

Second:
\
 \/\/\           v
      \/\/\  /\/\/\
           \/      \/
            ^

Third:
                v
  /\            /\
 /  \        /\/
/    \/\  /\/
        \/
         ^

Fourth:
         v
 /\      /\
/  \/\/\/  \/\
^             \

Fifth:
 /\   v
/  \/\/
    ^

Sixth:
    /\       v
   /  \      /\
  /    \/\/\/  \/\    /
 /      ^         \/\/
/

Clarification Regarding the Question

As some of the readers are confused on what exactly OP wants, although it's already stated clearly in the question, let me explain the question by some examples.

First, the task from the question:

And my task is to find the length of a substring which number of nulls is always <= number of ones. And this should always happen while 'scanning' substring from right to left and from left to right

This refers to something like "Catalan Number Ballot Problem" or "Available Change Problem". In the Wiki you can check the "monotonic path" problem, where you can map "move right" as "1" and "move up" as "0".

The problem is to find a subarray of the original array, such that, when the subarray is traversed from left-to-right and right-to-left, this property holds:

The number of 0's seen so far should not exceed the number of 1's seen so far.

For example, the string 1010 holds the property from left-to-right, because if we scan the array from left-to-right, there will always be more 1's than 0's. But the property doesn't hold from right-to-left, since the first character encountered from the right is 0, and so at the beginning we have more 0's (there is one) than 1's (there is none).

For example given by OP, we see that the answer for the string 1011010100 is the first eight characters, namely: 10110101. Why?

Ok, so when we traverse the subarray from left to right, we see that there is always more 1's than 0's. Let's check the number of 1's and 0's as we traverse the array from left-to-right:

1: num(0) = 0, num(1) = 1
0: num(0) = 1, num(1) = 1
1: num(0) = 1, num(1) = 2
1: num(0) = 1, num(1) = 3
0: num(0) = 2, num(1) = 3
1: num(0) = 2, num(1) = 4
0: num(0) = 3, num(1) = 4
1: num(0) = 3, num(1) = 5

We can see that at any point of time the number of 0's is always less than or equal to the number of 1's. That's why the property holds from left-to-right. And the same check can be done from right-to-left.

So why isn't 1011010100 and answer?

Let's see when we traverse the string right-to-left:

0: num(0) = 1, num(1) = 0
0: num(0) = 2, num(1) = 0
1: num(0) = 2, num(1) = 1
...

I didn't put the full traversal because the property has already been violated since the first step, since we have num(0) > num(1). That's why the string 1011010100 doesn't satisfy the constraints of the problem.

You can see also that my "height array" is actually the difference between the number of 1's and the number of 0's, namely: num(1) - num(0). So in order to have the property, we must have the [relative] height positive. That, can be visualized by having the height not less than the initial height.

like image 103
justhalf Avatar answered Oct 19 '22 22:10

justhalf


Here goes my algorithm:

Start from right side:

 1. if you find 0 increment the value of count
 2. if you find 1 decrement the count

Store these values in an array i.e. v[]. e.g.

a[] = {1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1}

v[] = {0, 1, 0,-1, 0, 1, 0, 1, 2, 1, 2, 1, 0, -1}

Now the problem reduces to find indexes from V i.e. i, j such that v[i] < v[j] and i<j.

proof:

if you see here i=0 and j=11 is the possible answer and values are v[i]=0, v[j]=1. This means that till j we have one 0 extra in the string and as the v[i]=0 that means from i to j window size the extra 0 is cancelled by putting extra 1. Hence the answer.

Hope it helps. please let me know if you have doubt. Thanks.

like image 32
Trying Avatar answered Oct 19 '22 23:10

Trying