Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing a bucket of numbers in an efficient data structure

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,.... I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.

So my question is what is the suitable data structure and a sorting mechanism for this type of problem.

Thanks in advance

like image 459
BlitzKrieg Avatar asked Apr 09 '10 22:04

BlitzKrieg


People also ask

What is bucket in data structure?

A bucket data structure is a data structure that uses the key values as the indices of the buckets, and store items of the same key value in the corresponding bucket. Naturally it makes the most sense to use the bucket data structure with integer key values.

What is bucketing the array?

Bucketing builds, the hash table as a 2D array instead of a single dimensional array. Every entry in the array is big, sufficient to hold M items (M is not amount of data. Just a constant). Problems. Lots of wasted space are created.

When to use bucket sort?

Bucket sort is a useful technique that is commonly used when inputs are uniformly distributed. In computer science, it is a very simple way for comparison, which is based on the algorithm of sorting. Bucket sort can also be used for partitioning an array into buckets where each bucket is individually sorted.


2 Answers

If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)

Keep the vector sorted. You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.

EDIT. Here is what I mean:

#include <stdio.h>

// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
    int middle;

    if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
        return -1;
    if(left + 1 == right) // found
        return left;

    middle = left + (right - left)/2;

    if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
        return findBucket(aNumber, leftBounds, left, middle);
    else
        return findBucket(aNumber, leftBounds, middle, right);
}


#define NBUCKETS 12
int main(void)
{
    int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
    // The buckets are 1-3, 4-6, 7-14, 15-31, ...

    int aNumber;
    for(aNumber = -3; aNumber < 103; aNumber++)
    {
        int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
        if(index < 0)
            printf("%d: Bucket not found\n", aNumber);
        else
            printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
    }   
    return 0;
}
like image 75
Federico A. Ramponi Avatar answered Oct 06 '22 01:10

Federico A. Ramponi


You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.

like image 44
Jared Updike Avatar answered Oct 05 '22 23:10

Jared Updike