Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binary search for multiple distinct numbers in a large array in minimum number of comparisons

I have a large array of size n (say n = 1000000) with values monotonically non-decreasing. I have a set of 'k' key values (say k = { 1,23,39,55,..}). Assume key values are sorted. I have to find the index of these key values in the large array using minimum number of comparisons. How do I use binary search to search for multiple unique values? Doing it separately for each key value takes lot of comparisons. Can I use reuse some knowledge I learned in one search somehow when I search for another element on the same big array?

like image 970
CRM Avatar asked Dec 20 '22 11:12

CRM


2 Answers

  1. Sort the needles (the values you will search for).
  2. Create an array of the same length as the needles, with each element being a pair of indexes. Initialize each pair with {0, len(haystack)}. These pairs represent all the knowledge we have of the possible locations of the needles.
  3. Look at the middle value in the haystack. Now do binary search for that value in your needles. For all lesser needles, set the upper bound (in the array from step 2) to the current haystack index. For all greater needles, set the lower bound.
  4. While you were doing step 3, keep track of which needle now has the largest range remaining. Bisect it and use this as your new middle value to repeat step 3. If the largest range is singular, you're done: all needles have been found (or if not found, their prospective location in the haystack is now known).

There may be some slight complication here when you have duplicate values in the haystack, but I think once you have the rest sorted out this should not be too difficult.


I was curious if NumPy implemented anything like this. The Python name for what you're doing is numpy.searchsorted(), and once you get through the API layers it comes to this:

    /*
     * Updating only one of the indices based on the previous key
     * gives the search a big boost when keys are sorted, but slightly
     * slows down things for purely random ones.
     */
    if (@TYPE@_LT(last_key_val, key_val)) {
        max_idx = arr_len;
    }
    else {
        min_idx = 0;
        max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
    }

So they do not do a full-blown optimization like I described, but they do track when the current needle is greater than the last needle, they can avoid searching the haystack below where the last needle was found. This is a simple and elegant improvement over the naive implementation, and as seen from the comments, it must be kept simple and fast because the function does not require the needles to be sorted in the first place.


By the way: my proposed solution aimed for something like theoretical optimality in big-O terms, but if you have a large number of needles, the fastest thing to do is probably to sort the needles then iterate over the entire haystack and all the needles in tandem: linear-search for the first needle, then resume from there to look for the second, etc. You can even skip every second item in the haystack by recognizing that if a needle is greater than A and less than C, it must belong at position B (assuming you don't care about the left/right insertion order for needles not in the haystack). You can then do about len(haystack)/2 comparisons and the entire thing will be very cache-friendly (after sorting the needles, of course).

like image 163
John Zwinck Avatar answered Dec 22 '22 01:12

John Zwinck


One way to reuse knowledge from previous steps is like others suggested: once you have located a key, you can restrict the search ranges for the smaller and larger keys.

Assuming N=2^n, K=2^k and lucky outcomes: after finding the middle key, (n comparisons), you have two subarrays of size N/2. Perform 2 searches for the "quartile" keys (n-1 comparisons each), reducing to N/4 subarrays...

In total, n + 2(n-1) + 4(n-2) + ... + 2^(k-1)(n-k+1) comparisons. After a bit of math, this equals roughly K.n-K.k = K.(n-k).

This is a best case scenario and the savings are not so significant compared to independent searches (K.n comparisons). Anyway, the worst case (all searches resulting in imbalanced partitions) is not worse than independent searches.

UPDATE: this is an instance of the Minimum Comparison Merging problem

Finding the locations of the K keys in the array of N values is the same as merging the two sorted sequences.

From Knuth Vol. 3, Section 5.3.2, we know that at least ceiling(lg(C(N+K,K))) comparisons are required (because there are C(N+K,K) ways to intersperse the keys in the array). When K is much smaller than N, this is close to lg((N^K/K!), or K lg(N) - K lg(K) = K.(n-k).

This bound cannot be beaten by any comparison-based method, so any such algorithm will take time essentially proportional to the number of keys.

like image 29
Yves Daoust Avatar answered Dec 22 '22 00:12

Yves Daoust