Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure / algorithm for query: filter by A, sort by B, return N results

Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?

find all objects where A between X and Y, order by B, return first N results;

That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options:

  1. With records (or index) sorted by B: Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m), which for large data sets of size m is not good enough.

  2. With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N. That's O(log m + k + k log k). If k is small then that's really O(log m), but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all mobjects.

  3. Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries.

Is there an algorithm / data structure which allows answering this query in sublinear time?

If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?

like image 827
Luís Marques Avatar asked Oct 26 '11 16:10

Luís Marques


3 Answers

The question you are asking is essentially a more general version of:

Q. You have a sorted list of words with a weight associated with each word, and you want all words which share a prefix with a given query q, and you want this list sorted by the associated weight.

Am I right?

If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.

http://dhruvbird.com/autocomplete.pdf

(I am the author).

Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.

2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log2n) if employing the naive approach. Either of them are sub-linear, so you should be okay.

Additionally, the time to list all entries would add an additional O(k) factor to your running time.

like image 129
dhruvbird Avatar answered Oct 17 '22 10:10

dhruvbird


assuming N << k < n, it can be done in O(logn + k + NlogN), similar to what you suggested in option 2, but saves some time, you don't need to sort all the k elements, but only N, which is much smaller!

the data base is sorted by A.

(1) find the first element and last element, and create a list containing these
    elements.
(2) find the N'th biggest element, using selection algorithm (*), and create a new 
    list of size N, with a second iteration: populate the last list with the N highest 
    elements.
(3) sort the last list by B.

Selection algorithm: find the N'th biggest element. it is O(n), or O(k) in here, because the list's size is k.

complexity:
Step one is trivially O(logn + k).
Step 2 is O(k) [selection] and another iteration is also O(k), since this list has only k elements.
Step 3 is O(NlogN), a simple sort, and the last list contains only N elements.

like image 27
amit Avatar answered Oct 17 '22 09:10

amit


If the number of items you want to return is small--up to about 1% of the total number of items--then a simple heap selection algorithm works well. See When theory meets practice. But it's not sub-linear.

For expected sub-linear performance, you can sort the items by A. When queried, use binary search to find the first item where A >= X, and then sequentially scan items until A > Y, using the heap selection technique I outlined in that blog post.

This should give you O(log n) for the initial search, and then O(m log k), where m is the number of items where X <= A <= Y, and k is the number of items you want returned. Yes, it will still be O(n log k) for some queries. The deciding factor will be the size of m.

like image 2
Jim Mischel Avatar answered Oct 17 '22 08:10

Jim Mischel