I have a large array of size n (say n = 1000000) with values monotonically non-decreasing. I have a set of 'k' key values (say k = { 1,23,39,55,..}). Assume key values are sorted. I have to find the index of these key values in the large array using minimum number of comparisons. How do I use binary search to search for multiple unique values? Doing it separately for each key value takes lot of comparisons. Can I use reuse some knowledge I learned in one search somehow when I search for another element on the same big array?
{0, len(haystack)}
. These pairs represent all the knowledge we have of the possible locations of the needles.There may be some slight complication here when you have duplicate values in the haystack, but I think once you have the rest sorted out this should not be too difficult.
I was curious if NumPy implemented anything like this. The Python name for what you're doing is numpy.searchsorted()
, and once you get through the API layers it comes to this:
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (@TYPE@_LT(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
So they do not do a full-blown optimization like I described, but they do track when the current needle is greater than the last needle, they can avoid searching the haystack below where the last needle was found. This is a simple and elegant improvement over the naive implementation, and as seen from the comments, it must be kept simple and fast because the function does not require the needles to be sorted in the first place.
By the way: my proposed solution aimed for something like theoretical optimality in big-O terms, but if you have a large number of needles, the fastest thing to do is probably to sort the needles then iterate over the entire haystack and all the needles in tandem: linear-search for the first needle, then resume from there to look for the second, etc. You can even skip every second item in the haystack by recognizing that if a needle is greater than A and less than C, it must belong at position B (assuming you don't care about the left/right insertion order for needles not in the haystack). You can then do about len(haystack)/2 comparisons and the entire thing will be very cache-friendly (after sorting the needles, of course).
One way to reuse knowledge from previous steps is like others suggested: once you have located a key, you can restrict the search ranges for the smaller and larger keys.
Assuming N=2^n, K=2^k and lucky outcomes: after finding the middle key, (n comparisons), you have two subarrays of size N/2. Perform 2 searches for the "quartile" keys (n-1 comparisons each), reducing to N/4 subarrays...
In total, n + 2(n-1) + 4(n-2) + ... + 2^(k-1)(n-k+1) comparisons. After a bit of math, this equals roughly K.n-K.k = K.(n-k).
This is a best case scenario and the savings are not so significant compared to independent searches (K.n comparisons). Anyway, the worst case (all searches resulting in imbalanced partitions) is not worse than independent searches.
UPDATE: this is an instance of the Minimum Comparison Merging problem
Finding the locations of the K keys in the array of N values is the same as merging the two sorted sequences.
From Knuth Vol. 3, Section 5.3.2, we know that at least ceiling(lg(C(N+K,K)))
comparisons are required (because there are C(N+K,K)
ways to intersperse the keys in the array). When K is much smaller than N, this is close to lg((N^K/K!)
, or K lg(N) - K lg(K) = K.(n-k)
.
This bound cannot be beaten by any comparison-based method, so any such algorithm will take time essentially proportional to the number of keys.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With