I have a large pool of objects with starting number and ending number. For example:
(999, 2333, data)
(0, 128, data)
(235, 865, data)
...
Assuming that the intervals don't overlap with each other. And I am writing a function that takes a number and locate the object that (low, high) contains it. Say given 333, I want the 3rd objects on the list.
Is there any way I can do this as efficiently as possible, short of linear search? I was thinking about binary search, but having some difficulties of coping with the range check.
Think if it worth sorting the data.
If you only want to search a few times, then it doesn't - and you cannot avoid linear search. total complexity of your searches will be O(n*k)
, where n
is the number of elements and k
is the number of searches.
If you want to search a lot of times, then you should first sort, and then search using binary search. It will be O(nlogn)
for sorting and O(klogn)
for searching k times, so you get total of O((n+k)logn)
.
Thus, sorting and then searching should be done only if k>=logn
P.S. You can use another approach to sort and search, as proposed in other answers, in all ways, the conclusion remains: do it only if k>=logn
You can use the bisect module: http://docs.python.org/library/bisect.html
You need to sort your data once, then use bisect:
import bisect
data=None
tuples=[(0, 128, None), (235, 865, None), (999, 2333, None)]
tuples.sort()
print tuples
print bisect.bisect(tuples, (-1,)) # 0
print bisect.bisect(tuples, (1,)) # 1
print bisect.bisect(tuples, (333,)) # 2
print bisect.bisect(tuples, (3333,)) # 3
If search speed is of utmost importance, then you can create a look-up table (as already commented by S.Lott). This will take O( r ) memory (where r is the size of the range), O( r ) pre-processing time, and O(1) search time. Create an array of the size of the range, and populate each element with a pointer to the data or null.
lookup = {}
for low, high, data in source_ranges:
for i in range(low,high): # or maybe high+1 if the ranges are inclusive
lookup[i] = data
Now the lookup is trivial.
First of all, it is not at all clear that binary search is warranted here. It may well be that linear search is faster when the number of intervals is small.
If you're concerned about performance, the prudent thing to do is to profile the code, and perhaps benchmark both methods on your typical inputs.
Disclaimers aside, binary search could be implemented by sorting the intervals once, and then repeatedly using the bisect
module to do the search:
import bisect
intervals = [(999, 2333, 'int1'), (0, 128, 'int2'), (235, 865, 'int3')]
intervals.sort()
def find_int(intervals, val):
pos = bisect.bisect_left([interval[1] for interval in intervals], val)
if pos < len(intervals) and val >= intervals[pos][0]:
return intervals[pos]
else:
return None
print(find_int(intervals, 0))
print(find_int(intervals, 1))
print(find_int(intervals, 200))
print(find_int(intervals, 998))
print(find_int(intervals, 999))
print(find_int(intervals, 1000))
print(find_int(intervals, 2333))
print(find_int(intervals, 2334))
In the above, I assume that the intervals are non-overlapping, and that the interval includes both its start and end points.
Lastly, to improve performance, one might consider factoring [interval[1] for interval in intervals]
out of the function and doing it just once at the start.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With