Given a data set of a few millions of price ranges, we need to find the smallest range that contains a given price.
The following rules apply:
Example:
Given the following price ranges:
The result for searching price 7 should be 5-10
The result for searching price 100 should be 100-120 (smallest range containing 100).
What's the most efficient algorithm/data structure to implement this?
Searching the web, I only found solutions for searching ranges within ranges.
I've been looking at Morton count and Hilbert curve, but can't wrap my head around how to use them for this case.
Thanks.
Apart from k-d trees and balanced binary that we saw, there are other data structures as well that can be used for range searching. Quad trees as the name suggest have either zero children or four children. They can efficiently store points in any arbitrary dimension. The current two-dimensional space is divided into four boxes.
If you are wanting quick access to the largest element in a data structure, take a look at heaps for an efficient way to keep objects in some sort of order. Show activity on this post. You have to loop through the array, no other way to check all elements.
Searching Algorithms are designed to check for an element or retrieve an element from any data structure where it is stored.
Range Searching is one of the most important fields in Computational Geometry and have applications in database searching and geographical databases. We have explored Algorithms & Data Structures for Range Searching (Multi-dimensional data). This article at OpenGenus gives an Advanced knowledge in this domain.
Because you did not mention this ad hoc algorithm, I'll propose this as a simple answer to your question:
This is a python function, but it's fairly easy to understand and convert it in another language.
def min_range(ranges, value):
# ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20)]
# value = 100
# INIT
import math
best_range = None
best_range_len = math.inf
# LOOP THROUGH ALL RANGES
for b, e in ranges:
# PICK THE SMALLEST
if b <= value <= e and e - b < best_range_len:
best_range = (b, e)
best_range_len = e - b
print(f'Minimal range containing {value} = {best_range}')
I believe there are more efficient and complicated solutions (if you can do some precomputation for example) but this is the first step you must take.
EDIT : Here is a better solution, probably in O(log(n)) but it's not trivial. It is a tree where each node is an interval, and has a child list of all strictly non overlapping intervals that are contained inside him. Preprocessing is done in O(n log(n)) time and queries are O(n) in worst case (when you can't find 2 ranges that don't overlap) and probably O(log(n)) in average.
2 classes: Tree that holds the tree and can query:
class tree:
def __init__(self, ranges):
# sort the ranges by lowest starting and then greatest ending
ranges = sorted(ranges, key=lambda i: (i[0], -i[1]))
# recursive building -> might want to optimize that in python
self.node = node( (-float('inf'), float('inf')) , ranges)
def __str__(self):
return str(self.node)
def query(self, value):
# bisect is for binary search
import bisect
curr_sol = self.node.inter
node_list = self.node.child_list
while True:
# which of the child ranges can include our value ?
i = bisect.bisect_left(node_list, (value, float('inf'))) - 1
# does it includes it ?
if i < 0 or i == len(node_list):
return curr_sol
if value > node_list[i].inter[1]:
return curr_sol
else:
# if it does then go deeper
curr_sol = node_list[i].inter
node_list = node_list[i].child_list
Node that holds the structure and information:
class node:
def __init__(self, inter, ranges):
# all elements in ranges will be descendant of this node !
import bisect
self.inter = inter
self.child_list = []
for i, r in enumerate(ranges):
if len(self.child_list) == 0:
# append a new child when list is empty
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
else:
# the current range r is included in a previous range
# r is not a child of self but a descendant !
if r[0] < self.child_list[-1].inter[1]:
continue
# else -> this is a new child
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
def __str__(self):
# fancy
return f'{self.inter} : [{", ".join([str(n) for n in self.child_list])}]'
def __lt__(self, other):
# this is '<' operator -> for bisect to compare our items
return self.inter < other
and to test that:
ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20), (50, 51)]
t = tree(ranges)
print(t)
print(t.query(10))
print(t.query(5))
print(t.query(40))
print(t.query(50))
Preprocessing that generates disjoined intervals
(I call source segments as ranges and resulting segments as intervals)
For ever range border (both start and end) make tuple: (value, start/end fiels, range length, id), put them in array/list
Sort these tuples by the first field. In case of tie make longer range left for start and right for end.
Make a stack
Make StartValue variable.
Walk through the list:
if current tuple contains start:
if interval is opened: //we close it
if current value > StartValue: //interval is not empty
make interval with //note id remains in stack
(start=StartValue, end = current value, id = stack.peek)
add interval to result list
StartValue = current value //we open new interval
push id from current tuple onto stack
else: //end of range
if current value > StartValue: //interval is not empty
make interval with //note id is removed from stack
(start=StartValue, end = current value, id = stack.pop)
add interval to result list
if stack is not empty:
StartValue = current value //we open new interval
After that we have sorted list of disjointed intervals containing start/end value and id of the source range (note that many intervals might correspond to the same source range), so we can use binary search easily.
If we add source ranges one-by-one in nested order (nested after it parent), we can see that every new range might generate at most two new intervals, so overall number of intervals M <= 2*N
and overall complexity is O(Nlog N + Q * logN)
where Q is number of queries
Edit:
Added if stack is not empty
section
Result for your example 1-100, 50-100, 100-120, 5-10, 5-20 is
1-5(0), 5-10(3), 10-20(4), 20-50(0), 50-100(1), 100-120(2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With