Here is a problem: I have random integers say count = M and they need to be divided equally (or roughly equally) in N buckets.
If I were to assign a range to M and N, N would be around 10000 and M could be from 100 to 5 million.
So far this looks like a small hashing problem. But here is what complicates it further. So these numbers are M in count but they are to be considered incrementally so say initially you have X no. of integers, you distribute them equally and then Y no. of integers are available more so you distribute them again, then Z no. of integers are available (X+Y+Z = M).
Also a particular no. should be distributed in such a fashion that their bucket no. can be searched efficiently.
So far here I thought couple of approaches but none of them could even come close to having equal distribution.
1) Have bucket no. high so max of N is 5 million. Equal distribution means 500 buckets so start out by creating 500 buckets. They would be eventually full equally. But this too has end cases which can be dirty to handle. 2) Have bucket size according to size available currently (X then later on X+Y then M) and if it is full rehash to increase no. of buckets. This may be a costly exercise in my use case and would want to avoid it. 3) Somehow trying to fit Bin Packing problem. But it doesn't readily tell me what is the bin in which an integer will go. One obvious thing to keep in mind is that since these are random nos, if the count is say 100,000 one of the nos. could as well be 500,000.
What approach do you recommend? I can provide the use case later if needed.
You are way over complicating this. The integers are random, so no thinking is required. If the integers were not random, then we might have to come up with a hash algorithm.
So long as the range of the integers is reasonably greater than the number of buckets, just assign them to their bucket by the modulo of the number of buckets.
Like this:
void assignToBucket( int r )
{
bucket[ r % NUM_BUCKETS ].add( r );
}
It doesn't matter how many you try to insert - or if they come in all at once, or in several passes. So long as the stream is random, then the modulo will ensure they are roughly evenly distributed in the buckets.
This won't work if the range of each r is close to the number of buckets. That is if each r is from 0-7 and there are 6 buckets it won't distribute evenly. And it won't work with a non-random stream.
For a stream with a non-random distribution, you would need to know something about the distribution to create a proper hash function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With