Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient search of sorted numerical values

I have an int[] array that contains values with the following properties:

  • They are sorted
  • They are unique (no duplicates)
  • They are in a known range [0..MAX)
  • MAX is typically quite a lot larger than the length of the array (say 10-100x)
  • Sometimes the numbers are evenly distributed across the range, but at other times there are quite long sequences of consecutive numbers. I estimate it is about 50/50 between the two cases.

Given this list, I want to efficiently find the index of a specific value in the array (or if the value is not present, find the next higher value).

I've already implemented a straight binary search with interval bisection that works fairly well, but I have a suspicion that the nature/distribution of the data can be exploited to converge to a solution faster.

I'm interested in optimising the average case search time, but it is important that the worst case is never worse than O(log n) as the arrays are sometimes very large.

Question: it is possible to do much better than a plain binary search in the average case?

EDIT (to clarify additional questions / comments)

  • The constant in O(log n) definitely matters. In fact assuming that better algorithmic complexity than O(log n) isn't possible, the constant is probably the only thing that matters.....
  • It's often a one-off search, so while preprocessing is possible it's probably not going to be worth it.
like image 758
mikera Avatar asked Jan 05 '14 13:01

mikera


2 Answers

This is in the comments and should be an answer. It's a joint effort, so I'm making it a CW answer:

You may want to look at an interpolation search. In the worst case, they can be worse than O(log n) and so if that's a hard requirement, this wouldn't apply. But if your interpolation is decent, depending on the data distribution an interpolation search can beat a straight binary.

To know, you'd have to implement the interpolation search with a reasonably smart interpolation algorithm, and then run several representative data sets through both to see whether the interpolation or the binary is better suited. I'd think it'd be one of the two, but I'm not au fait with truly cutting edge searching algorithms.

like image 127
T.J. Crowder Avatar answered Sep 27 '22 22:09

T.J. Crowder


Let's name the interval x here and z the searched number.

Since you expect the values to be evenly distributed, you can use interpolation search. This is similar to binary search, but splits the index range at start + ((z - x[start]) * (end - start)) / (x[end] - x[start]).

To get a running time of O(log n) you have to do combine interpolation search with binary search (do step from binary search and step from interpolation search alternating):

public int search(int[] values, int z) {
    int start = 0;
    int end = values.length-1;

    if (values[0] == z)
         return 0;
    else if (values[end] == z) {
        return end;
    }

    boolean interpolation = true;

    while (start < end) {
        int mid;
        if (interpolation) {
            mid = start + ((z - values[start]) * (end - start)) / (values[end] - values[start]);
        } else {
            mid = (end-start) / 2;
        }
        int v = values[mid];
        if (v == z)
            return mid;
        else if (v > z)
            end = mid;
        else
            start = mid;
        interpolation = !interpolation;
    }
    return -1;
}

Since every second iteration of the while loop does a step in binary search, it uses at most twice the number of iterations a binary search would use (O(log n)). Since every second step is a step from interpolation search, it the algorithm should reduce the intervall size fast, if the input has the desired properties.

like image 37
fabian Avatar answered Sep 27 '22 20:09

fabian