Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to avoid the linear search on this?

I have a large pool of objects with starting number and ending number. For example:

(999, 2333, data) 
(0, 128, data) 
(235, 865, data)
...

Assuming that the intervals don't overlap with each other. And I am writing a function that takes a number and locate the object that (low, high) contains it. Say given 333, I want the 3rd objects on the list.

Is there any way I can do this as efficiently as possible, short of linear search? I was thinking about binary search, but having some difficulties of coping with the range check.

like image 634
Oliver Avatar asked Jan 18 '12 15:01

Oliver


4 Answers

Think if it worth sorting the data.
If you only want to search a few times, then it doesn't - and you cannot avoid linear search. total complexity of your searches will be O(n*k), where n is the number of elements and k is the number of searches.

If you want to search a lot of times, then you should first sort, and then search using binary search. It will be O(nlogn) for sorting and O(klogn) for searching k times, so you get total of O((n+k)logn).

Thus, sorting and then searching should be done only if k>=logn

P.S. You can use another approach to sort and search, as proposed in other answers, in all ways, the conclusion remains: do it only if k>=logn

like image 137
amit Avatar answered Nov 20 '22 11:11

amit


You can use the bisect module: http://docs.python.org/library/bisect.html

You need to sort your data once, then use bisect:

import bisect
data=None
tuples=[(0, 128, None), (235, 865, None), (999, 2333, None)]
tuples.sort()
print tuples
print bisect.bisect(tuples, (-1,))   # 0
print bisect.bisect(tuples, (1,))    # 1
print bisect.bisect(tuples, (333,))  # 2
print bisect.bisect(tuples, (3333,)) # 3
like image 26
guettli Avatar answered Nov 20 '22 12:11

guettli


If search speed is of utmost importance, then you can create a look-up table (as already commented by S.Lott). This will take O( r ) memory (where r is the size of the range), O( r ) pre-processing time, and O(1) search time. Create an array of the size of the range, and populate each element with a pointer to the data or null.

lookup = {}
for low, high, data in source_ranges:
    for i in range(low,high): # or maybe high+1 if the ranges are inclusive
        lookup[i] = data

Now the lookup is trivial.

like image 2
cyborg Avatar answered Nov 20 '22 12:11

cyborg


First of all, it is not at all clear that binary search is warranted here. It may well be that linear search is faster when the number of intervals is small.

If you're concerned about performance, the prudent thing to do is to profile the code, and perhaps benchmark both methods on your typical inputs.

Disclaimers aside, binary search could be implemented by sorting the intervals once, and then repeatedly using the bisect module to do the search:

import bisect

intervals = [(999, 2333, 'int1'), (0, 128, 'int2'), (235, 865, 'int3')]
intervals.sort()

def find_int(intervals, val):
   pos = bisect.bisect_left([interval[1] for interval in intervals], val)
   if pos < len(intervals) and val >= intervals[pos][0]:
      return intervals[pos]
   else:
      return None

print(find_int(intervals, 0))
print(find_int(intervals, 1))
print(find_int(intervals, 200))
print(find_int(intervals, 998))
print(find_int(intervals, 999))
print(find_int(intervals, 1000))
print(find_int(intervals, 2333))
print(find_int(intervals, 2334))

In the above, I assume that the intervals are non-overlapping, and that the interval includes both its start and end points.

Lastly, to improve performance, one might consider factoring [interval[1] for interval in intervals] out of the function and doing it just once at the start.

like image 1
NPE Avatar answered Nov 20 '22 12:11

NPE