Does there exist a way to look up a value's corresponding range quickly?
For example: I have a bunch of pre-specified ranges: [0,50)
, [50, 100)
, [100,200)
, etc. I want to check which pre-specified range a decimal value falls under:
I know that if I'm just checking for particular discrete values, it is very simple to get it to O(1)
by using a data structure with key/value:
function(value) {
return obj[value];
}
However, if I want to check for which range a value falls under, then I don't see a way to do it without iterating through the ranges, which makes it O(n)
:
function(value) {
if (value < a) {
return 'a';
} else if (value < b) {
return 'b';
} else if (value < c) {
return 'c';
}
}
If my input was an integer or some other discrete value then I could still use the data structure method, but if my input is a decimal, does there exist a way to do this more quickly?
One method that I tried was using the fact that the function x/abs(x)
creates a step function and then composing multiple of these functions together. For example:
x/abs(x) + (x-50)/(abs(x-50)) + (x-100)/(abs(x-100)) + (x-200)/(abs(x-200))
This maps all of the decimal values to discrete outputs, and then you just apply the object key/value lookup.
But the number of operations needed to compute this is still O(n)
.
Since you mentioned that your ranges are contiguous, here's an O(1) algorithm for finding the range a specific value falls under. Its downside is that it potentially require a lot of memory for very large ranges mixed with very small ranges.
Choose an appropriate constant s <= 1
. The larger s
is the faster the algorithm but the more memory it uses.
Create an array of arrays T
. For each range i
with limits lo, hi
, append i
to T[j]
for all s*lo <= j < s*hi
. Now preprocessing is done.
When looking up a value x
, look in T[floor(s*x)]
and T[ceil(s*x)]
. Within one of these two arrays you will find the range x
belongs in.
As an example, if I choose s = 1/50
I can make the following lookup table T
for the ranges [0,50), [50, 100), [100,200)
. In fact since each array will have size 1 I can just directly construct an array instead of an array of arrays:
[0, 1, 2, 2]
Doing a lookup for x = 141 we look in T[floor(141/50)] = T[2] = 2
and find that indeed ranges[2]
contains our range of interest.
For contiguous integer ranges we find that s = 1/gcd(lo1, lo2, lo3, ...)
is the most memory-efficient choice of s
that maintains the fastest lookup time with only 1 range per entry in T
.
In the general case, you can't do better than binary searching a sorted list of intervals (or some equivalent, like storing the intervals in a binary search tree, a skip list, or some other such datastructure). That will make lookup average O(log N) where N is the number of ranges.
But if you know something about the distribution of the ranges and the probability that a query will fall in particular intervals, you might be able to get a lot closer to average O(1) lookup. (Although it must be said that O(log N) is already a good approximation to O(1) for all practical problem sizes.)
From the comments, I gather that you expect the distribution of interval sizes to be biased, so that a few intervals dominate the domain of the function. In that case, you do have some better options.
First, suppose you expect queries to be uniformly distributed. That means that most queries will fall in the large ranges, since the large intervals occupy most of the query domain. So if you can reduce the cost of querying in large intervals without adding much overhead to the cost of general queries, then you've got a win. A simple way to do that is to build a index vector which matches uniformly-spaced values in the query space to the index of the interval which contains that value. To handle a query, you first find the indices of the intervals of the reference points which bracket the query value:
let q = floor((query-query_min) / index_width)
let lo = index[q]
let hi = index[q+1]
result = binary search between interval lo and interval hi
For queries in large intervals, it is likely that lo == hi
, which means the binary search can be avoided. Since we're assuming that most queries fall into the large intervals, that means that there is a high probability of an O(1) query. Even if lo
is not hi
, there will be a reduction in the cost of the binary search; for small values of hi - lo
, a linear search will quite likely be faster.
Some experimentation might be necessary to find an optimal spacing of index values, but the simple expedient of creating as many index values as there are intervals should be fine, and only doubles the storage cost.
However, that optimisation may not produce good results if the queries are not uniformly distributed. In some typical use cases, queries are very likely to fall in small intervals. (One example is looking up character attributes in a Unicode database. More than three-quarters of the codepoint search space falls into a handful of intervals, corresponding to unassigned and private use area characters. But that helps little, because it is very rare that a codepoint for such a character needs to be looked up.
So if you expect that sort of pattern, you can use an alternative indexing approach, which attempts to speed up queries on small intervals. Once again, we construct an index mapping, but this time we use a hash table instead of a vector, and we map the index point to the interval of intervals corresponding to that index point and the next index point. When we construct the mapping, we simply omit datapoints which fall in the large intervals. That makes the index mapping quite sparse, which is why we need to use a hash table. (We need to map to a pair of interval indices because most hash tables can't tell us the successor key/value pair, and anyway it's possible that the next key is not in the hash table. But the pair of intervals don't occupy much space.)
With that modification, we can make the index spacing quite fine. For example, we might still define the same number of index points as there are intervals, but since the large intervals are excluded, the width of an index bucket will be slightly less than the average width of a small interval, which might be quite a lot smaller than the average width of an interval.
We also create a small auxiliary interval list, consisting only of large intervals.
Now, to handle a query, we proceed as follows:
q
, as above; the bucket number in the index table.q
in the index hash table (which should be O(1) on average). If we find it, we can limit the binary search to the interval of intervals indicated. If all goes well, that will normally be a small set of intervals.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With