I have a lot of ranges in the form [(1, 1000), (5000, 5678), ... ]
. I'm trying to figure out the fastest way to check if a number is within any of the ranges. The ranges are made up of longs
and are too large to just keep a set
of all the numbers.
The simplest solution is this:
ranges = [(1,5), (10,20), (40,50)] # The real code has a few dozen ranges
nums = range(1000000)
%timeit [n for n in nums if any([r[0] <= n <= r[1] for r in ranges])]
# 1 loops, best of 3: 5.31 s per loop
Banyan is a bit faster:
import banyan
banyan_ranges = banyan.SortedSet(updator=banyan.OverlappingIntervalsUpdator)
for r in ranges:
banyan_ranges.add(r)
%timeit [n for n in nums if len(banyan_ranges.overlap_point(n))>0]
# 1 loops, best of 3: 452 ms per loop
Although there are only a few dozen ranges, there are millions of checks against those ranges. What's the fastest way to do these checks?
(Note: This question is similar to Python: efficiently check if integer is within *many* ranges but does not have the same Django-related restrictions and is exclusively concerned with speed)
Python range() to check integer in between two numbersWe can also use Python range function that does this job for us. It can quite easily identify if the integer lies between two numbers or not. Here, we've called the range() function which includes the lower range (X) but discards the edge value, i.e., Y.
The reason that the range() function is so fast in Python3 is that here we use mathematical reasoning for the bounds, rather than a direct iteration of the range object.
Ideally, if you only want to iterate over that range of values , range(0,3) would be faster than (list(range(0,3)) because the latter has the overhead of producing a list before you start iterating over it. Save this answer.
Things to try:
bisect
module to do the searching. (Don't implement your own bisection search by hand!) Note that with the preprocessing in 1, all you'll need to know is whether the result of the bisect
call is even or odd.numpy.searchsorted
.Some code and timings. First the setup (here using IPython 2.1 and Python 3.4):
In [1]: ranges = [(1, 5), (10, 20), (40, 50)]
In [2]: nums = list(range(1000000)) # force a list to remove generator overhead
Timings for the original method on my machine (but with a generator expression instead of a list comprehension):
In [3]: %timeit [n for n in nums if any(r[0] <= n <= r[1] for r in ranges)]
1 loops, best of 3: 922 ms per loop
Now we rework the ranges as a list of boundary points; each boundary point at an even index is an entry point to one of the ranges, while each boundary point at an odd index is an exit point. Note the conversion to half-open intervals, and that I've put all the numbers into a single list.
In [4]: boundaries = [1, 6, 10, 21, 40, 51]
With this it's easy to use bisect.bisect
to get the same results as before, but rather faster.
In [5]: from bisect import bisect
In [6]: %timeit [n for n in nums if bisect(boundaries, n) % 2]
1 loops, best of 3: 298 ms per loop
Finally, depending on the context, you may be able to make use of the searchsorted
function from NumPy. This is like bisect.bisect
, but operates on whole collections of values at once. For example:
In [7]: import numpy
In [8]: numpy.where(numpy.searchsorted(boundaries, nums, side="right") % 2)[0]
Out[8]:
array([ 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50])
At first glance, the %timeit
results from this are rather disappointing.
In [9]: %timeit numpy.where(numpy.searchsorted(boundaries, nums, side="right") % 2)[0]
10 loops, best of 3: 159 ms per loop
However, it turns out that much of the performance cost is in converting the inputs to searchsorted
from Python lists to NumPy arrays. Let's preconvert both lists to arrays and try again:
In [10]: boundaries = numpy.array(boundaries)
In [11]: nums = numpy.array(nums)
In [12]: %timeit numpy.where(numpy.searchsorted(boundaries, nums, side="right") % 2)[0]
10 loops, best of 3: 24.6 ms per loop
Much faster than anything else so far. However, this is cheating a bit: we can certainly preprocess boundaries
to turn it into an array, but if the values you want to test aren't naturally produced in array form then the cost of conversion will need to be taken into account. On the other hand, it shows that the cost of the search itself can be reduced to a small enough value that it's no longer likely to be the dominant factor in the running time.
Here's another option along these lines. It uses NumPy again, but does a direct non-lazy linear search per value. (Please forgive the out-of-order IPython
prompts: I added this one later. :-)
In [29]: numpy.where(numpy.logical_xor.reduce(numpy.greater_equal.outer(boundaries, nums), axis=0))
Out[29]:
(array([ 2, 3, 4, 5, 6, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51]),)
In [30]: %timeit numpy.where(numpy.logical_xor.reduce(numpy.greater_equal.outer(boundaries, nums), axis=0))
10 loops, best of 3: 16.7 ms per loop
For these particular test data, this is faster than searchsorted
, but the time will grow linearly in the number of ranges, while for searchsorted
it should grow according to the log of the number of ranges. Note that it also uses an amount of memory proportional to len(boundaries) * len(nums)
. This isn't necessarily a problem: if you find yourself bumping into memory constraints you can probably chunk the arrays into smaller sizes (say 10000 elements at a time) without losing too much performance.
Moving up the scale, if none of these fits the bill I'd next try Cython and NumPy, writing a search function (with inputs declared as arrays of ints) that does a simple linear search on the boundaries
array. I tried this, but failed to get results better than those based on bisect.bisect
. For reference, here's the Cython code I tried; you may be able to do better:
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def search(np.ndarray[long, ndim=1] boundaries, long val):
cdef long j, k, n=len(boundaries)
for j in range(n):
if boundaries[j] > val:
return j & 1
return 0
And the timings:
In [13]: from my_cython_extension import search
In [14]: %timeit [n for n in nums if search(boundaries, n)]
1 loops, best of 3: 793 ms per loop
An implementation of @ArminRigo's comment, which is pretty fast. The timing is from CPython, not PyPy:
exec_code = "def in_range(x):\n"
first_if = True
for r in ranges:
if first_if:
exec_code += " if "
first_if = False
else:
exec_code += " elif "
exec_code += "%d <= x <= %d: return True\n" % (r[0], r[1])
exec_code += " return False"
exec(exec_code)
%timeit [n for n in nums if in_range(n)]
# 10 loops, best of 3: 173 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With