Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What would be the most efficient way to find a[i] = i in a sorted array?

Given an array a[], what would be the most efficient way to determine whether or not at least one element i satisfies the condition a[i] == i?

All the elements in the array are sorted and distinct, but they aren't necessarily integer types (i.e. they might be floating point types).

like image 281
k53sc Avatar asked Nov 29 '12 19:11

k53sc


1 Answers

Several people have made claims about the relevance of “sorted”, “distinct” and “aren't necessarily integers”. In fact, proper selection of an efficient algorithm to solve this problem hinges on these characteristics. A more efficient algorithm would be possible if we could know that the values in the array were both distinct and integral, while a less efficient algorithm would be required if the values might be non-distinct, whether or not they were integral. And of course, if the array was not already sorted, you could sort it first (at average complexity O(n log n)) and then use the more efficient pre-sorted algorithm (i.e. for a sorted array), but in the unsorted case it would be more efficient to simply leave the array unsorted and run through it directly comparing the values in linear time (O(n)). Note that regardless of the algorithm chosen, best-case performance is O(1) (when the first element examined contains its index value); at any point during execution of any algorithm we might come across an element where a[i] == i at which point we return true; what actually matters in terms of algorithm performance in this problem is how quickly we can exclude all elements and declare that there is no such element a[i] where a[i] == i.

The problem does not state the sort order of a[], which is a pretty critical piece of missing information. If it’s ascending, the worst-case complexity will always be O(n), there’s nothing we can do to make the worst-case complexity better. But if the sort order is descending, even the worst-case complexity is O(log n): since values in the array are distinct and descending, there is only one possible index where a[i] could equal i, and basically all you have to do is a binary search to find the crossover point (where the ascending index values cross over the descending element values, if there even is such a crossover), and determine if a[c] == c at the crossover point index value c. Since that’s pretty trivial, I’ll proceed assuming that the sort order is ascending. Interestingly if the elements were integers, even in the ascending case there is a similar “crossover-like” situation (though in the ascending case there could be more than one a[i] == i match), so if the elements were integers, a binary search would also be applicable in the ascending case, in which case even the worst-case performance would be O(log n) (see Interview question - Search in sorted array X for index i such that X[i] = i). But we aren’t given that luxury in this version of the problem.

Here is how we might solve this problem:

Begin with the first element, a[0]. If its value is == 0, you’ve found an element which satisfies a[i] == i so return true. If its value is < 1, the next element (a[1]) could possibly contain the value 1, so you proceed to the next index. If, however, a[0] >= 1, you know (because the values are distinct) that the condition a[1] == 1 cannot possibly be true, so you can safely skip index 1. But you can even do better than that: For example, if a[0] == 12, you know (because the values are sorted in ascending order) that there cannot possibly be any elements that satisfy a[i] == i prior to element a[13]. Because the values in the array can be non-integral, we cannot make any further assumptions at this point, so the next element we can safely skip to directly is a[13] (e.g. a[1] through a[12] may all contain values between 12.000... and 13.000... such that a[13] could still equal exactly 13, so we have to check it).

Continuing that process yields an algorithm as follows:

// Algorithm 1
bool algorithm1(double* a, size_t len)
{
    for (size_t i=0; i<len; ++i) // worst case is O(n)
    {
        if (a[i] == i)
            return true; // of course we could also return i here (as an int)...
        if (a[i] > i)
            i = static_cast<size_t>(std::floor(a[i]));
    }
    return false; // ......in which case we’d want to return -1 here (an int)
}

This has pretty good performance if many of the values in a[] are greater than their index value, and has excellent performance if all values in a[] are greater than n (it returns false after only one iteration!), but it has dismal performance if all values are less than their index value (it will return false after n iterations). So we return to the drawing board... but all we need is a slight tweak. Consider that the algorithm could have been written to scan backwards from n down to 0 just as easily as it can scan forward from 0 to n. If we combine the logic of iterating from both ends toward the middle, we get an algorithm as follows:

// Algorithm 2
bool algorithm2(double* a, size_t len)
{
    for (size_t i=0, j=len-1; i<j; ++i,--j) // worst case is still O(n)
    {
        if (a[i]==i || a[j]==j)
            return true;
        if (a[i] > i)
            i = static_cast<size_t>(std::floor(a[i]));
        if (a[j] < j)
            j = static_cast<size_t>(std::ceil(a[j]));
    }
    return false;
}

This has excellent performance in both of the extreme cases (all values are less than 0 or greater than n), and has pretty good performance with pretty much any other distribution of values. The worst case is if all of the values in the lower half of the array are less than their index and all of the values in the upper half are greater than their index, in which case the performance degrades to the worst-case of O(n). Best case (either extreme case) is O(1), while average case is probably O(log n) but I’m deferring to someone with a math major to determine that with certainty.

Several people have suggested a “divide and conquer” approach to the problem, without specifying how the problem could be divided and what one would do with the recursively divided sub-problems. Of course such an incomplete answer would probably not satisfy the interviewer. The naïve linear algorithm and worst-case performance of algorithm 2 above are both O(n), while algorithm 2 improves the average-case performance to (probably) O(log n) by skipping (not examining) elements whenever it can. The divide-and-conquer approach can only outperform algorithm 2 if, in the average case, it is somehow able to skip more elements than algorithm 2 can skip. Let’s assume we divide the problem by splitting the array into two (nearly) equal contiguous halves , recursively, and decide if, with the resulting sub-problems, we are likely to be able to skip more elements than algorithm 2 could skip, especially in algorithm 2’s worst case. For the remainder of this discussion, let’s assume an input that would be worst-case for algorithm 2. After the first split, we can check both halves’ top & bottom elements for the same extreme case that results in O(1) performance for algorithm2, yet results in O(n) performance with both halves combined. This would be the case if all elements in the bottom half are less than 0 and all elements in the upper half are greater than n-1. In these cases, we can immediately exclude the bottom and/or top half with O(1) performance for any half we can exclude. Of course the performance of any half that cannot be excluded by that test remains to be determined after recursing further, dividing that half by half again until we find any segment whose top or bottom element contains its index value. That’s a reasonably nice performance improvement over algorithm 2, but it occurs in only certain special cases of algorithm 2’s worst case. All we’ve done with divide-and-conquer is decrease (slightly) the proportion of the problem space that evokes worst-case behavior. There are still worst-case scenarios for divide-and-conquer, and they exactly match most of the problem space that evokes worst-case behavior for algorithm 2.

So, given that the divide-and-conquer algorithm has less worst-case scenarios, doesn’t it make sense to go ahead and use a divide-and-conquer approach?

In a word, no. Well, maybe. If you know up front that about half of your data is less than 0 and half is greater than n, this special case would generally fare better with the divide-and-conquer approach. Or, if your system is multicore and your ‘n’ is large, it might be helpful to split the problem evenly between all of your cores, but once it’s split between them, I maintain that the sub-problems on each core are probably best solved with algorithm 2 above, avoiding further division of the problem and certainly avoiding recursion, as I argue below....

At each recursion level of a recursive divide-and-conquer approach, the algorithm needs some way to remember the as-yet-unsolved 2nd half of the problem while it recurses into the 1st half. Often this is done by having the algorithm recursively call itself first for one half and then for the other, a design which maintains this information implicitly on the runtime stack. Another implementation might avoid recursive function calls by maintaining essentially this same information on an explicit stack. In terms of space growth, algorithm 2 is O(1), but any recursive implementation is unavoidably O(log n) due to having to maintain this information on some sort of stack. But aside from the space issue, a recursive implementation has extra runtime overhead of remembering the state of as-yet-unrecursed-into subproblem halves until such time as they can be recursed into. This runtime overhead is not free, and given the simplicity of algorithm 2’s implementation above, I posit that such overhead is proportionally significant. Therefore I suggest that algorithm 2 above will roundly spank any recursive implementation for the vast majority of cases.

like image 97
phonetagger Avatar answered Sep 21 '22 08:09

phonetagger