Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.

I need to:

  1. Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.

  2. Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.

Is there any data structure or algorithm specialized in integer ranges to solve this task?

like image 729
actual Avatar asked Jan 05 '11 07:01

actual


People also ask

Which data structure is best for lookup?

Looking at complexity analysis, Hashtables seem to be the most efficient for lookups.

What is lookup in data structure?

In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation.

What is the best data structure type to store a list of numbers?

Explanation: The only two choices that make sense are Array and Linked List.

Which data structure has fast lookup?

For efficient searching, one would definitely prefer a binary search tree.


3 Answers

Regarding the second issue:

You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).

In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.

Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.

  • single elements may be best stored in a hash-table
  • actual ranges can be stored in a sorted array

This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10

I would therefore suggest the following steps:

  • Bloom Filter: if no, stop
  • Sorted Array of real ranges: if yes, stop
  • Hash Table of single elements

The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)

One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

like image 166
Matthieu M. Avatar answered Oct 22 '22 07:10

Matthieu M.


If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.

The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.

To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.

Hope this helps!

like image 7
templatetypedef Avatar answered Oct 22 '22 07:10

templatetypedef


AFAIK there is no such algorithm that search over integer list in O(1).

One only can do O(1) search with vast amount of memory.

So it is not very promising to try to find O(1) search algorithm over list of range of integer.

On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).

like image 2
9dan Avatar answered Oct 22 '22 08:10

9dan