The problem is, given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file, assume only have 10 MB of memory.
Searched for some solutions, one of which is to store integers into bit-vector blocks (each block representing a specific range of integers among 4 billion range, each bit in the block represent for an integer), and using another counter for each block, to count the number of integers in each block. So that if number of integers is less than the block capacity for integers, scan the bit-vector of the block to find which are missing integers.
My confusion for this solution is, it is mentioned the optimal smallest footprint is, when the array of block counters occupies the same memory as the bit vector. I am confused why in such situation it is the optimal smallest footprint?
Here are calculation details I referred,
Let N = 2^32.
counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4
thanks in advance, Lin
In the solution you proposed, there are two phases:
Count number of integers in each block
This uses 4*(#blocks)
bytes of memory.
Use a bit-vector each bit representing an integer in the block.
This uses (blocksize/8)
bytes of memory, which is (N/blocks)/8
.
Setting the 2 to be equal results in blocks = sqrt(N/32)
as you have mentioned.
This is the optimal because the memory required is the maximum of the memory required in each phase (which must both be executed). After the 1st phase, you can forget the counters, except for which block to search in for phase 2.
If your counter saturates when it reaches capacity, you don't really need 4 bytes per counter, but rather 3 bytes. A counter reaches capacity when it exceeds the number of integers in the block.
In this case, phase 1 uses 3*blocks
of memory, and phase 2 uses (N/blocks)/8
. Therefore, the optimal is blocks = sqrt(N/24)
. If N
is 4 billion, the number of blocks is approx 12910, and the block size is 309838 integers per block. This fits in 3 bytes.
The algorithm you proposed only works if all input integers are distinct. In case the integers are not distinct, I suggest you simply go with a randomized candidate set of integers approach. In a randomized candidate set of integers approach, you can select say 1000 candidate integers at random, and check if any are not found in the input file. If you fail, you can try find another random set of candidate integers. While this has poor worst case performance, it would be faster in the average case for most input. For example, if the input integers have a coverage of 99% of possible integers, then on average, with 1000 candidate integers, 10 of them will not be found. You can select the candidate integers pseudo-randomly so that you never repeat a candidate integer, and also to guarantee that in a fixed number of tries, you will have tested all possible integers.
If each time, you check sqrt(N)
candidate integers, then the worst case performance can be as good as N*sqrt(N)
, because you might have to scan all N integers sqrt(N)
times.
You can avoid the worst case time if you use this alternative, and if it doesn't work for the first set of candidate integers, you switch to your proposed solution. This might give better average case performance (this is a common strategy in sorting where quicksort is used first, before switching to heapsort for example if it appears that the worst case input is present).
# assumes all integers are positive and fit into an int
# very slow, but definitely uses less than 10MB RAM
int generate_unique_integer(file input-file)
{
int largest=0
while (not eof(input-file))
{
i=read(integer)
if (i>largest) largest=i
}
return i++; #larger than the largest integer in the input file
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With