Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using 10 MB of memory for four billion integers (about finding the optimized block size) [duplicate]

The problem is, given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file, assume only have 10 MB of memory.

Searched for some solutions, one of which is to store integers into bit-vector blocks (each block representing a specific range of integers among 4 billion range, each bit in the block represent for an integer), and using another counter for each block, to count the number of integers in each block. So that if number of integers is less than the block capacity for integers, scan the bit-vector of the block to find which are missing integers.

My confusion for this solution is, it is mentioned the optimal smallest footprint is, when the array of block counters occupies the same memory as the bit vector. I am confused why in such situation it is the optimal smallest footprint?

Here are calculation details I referred,

Let N = 2^32.
counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4

thanks in advance, Lin

like image 915
Lin Ma Avatar asked Oct 10 '15 03:10

Lin Ma


2 Answers

Why it is the smallest memory footprint:

In the solution you proposed, there are two phases:

  1. Count number of integers in each block

    This uses 4*(#blocks) bytes of memory.

  2. Use a bit-vector each bit representing an integer in the block.

    This uses (blocksize/8) bytes of memory, which is (N/blocks)/8.

Setting the 2 to be equal results in blocks = sqrt(N/32) as you have mentioned.

This is the optimal because the memory required is the maximum of the memory required in each phase (which must both be executed). After the 1st phase, you can forget the counters, except for which block to search in for phase 2.

Optimization

If your counter saturates when it reaches capacity, you don't really need 4 bytes per counter, but rather 3 bytes. A counter reaches capacity when it exceeds the number of integers in the block.

In this case, phase 1 uses 3*blocks of memory, and phase 2 uses (N/blocks)/8. Therefore, the optimal is blocks = sqrt(N/24). If N is 4 billion, the number of blocks is approx 12910, and the block size is 309838 integers per block. This fits in 3 bytes.

Caveats, and alternative with good average case performance

The algorithm you proposed only works if all input integers are distinct. In case the integers are not distinct, I suggest you simply go with a randomized candidate set of integers approach. In a randomized candidate set of integers approach, you can select say 1000 candidate integers at random, and check if any are not found in the input file. If you fail, you can try find another random set of candidate integers. While this has poor worst case performance, it would be faster in the average case for most input. For example, if the input integers have a coverage of 99% of possible integers, then on average, with 1000 candidate integers, 10 of them will not be found. You can select the candidate integers pseudo-randomly so that you never repeat a candidate integer, and also to guarantee that in a fixed number of tries, you will have tested all possible integers.

If each time, you check sqrt(N) candidate integers, then the worst case performance can be as good as N*sqrt(N), because you might have to scan all N integers sqrt(N) times.

You can avoid the worst case time if you use this alternative, and if it doesn't work for the first set of candidate integers, you switch to your proposed solution. This might give better average case performance (this is a common strategy in sorting where quicksort is used first, before switching to heapsort for example if it appears that the worst case input is present).

like image 50
ronalchn Avatar answered Nov 07 '22 08:11

ronalchn


# assumes all integers are positive and fit into an int
# very slow, but definitely uses less than 10MB RAM
int generate_unique_integer(file input-file)
{
  int largest=0
  while (not eof(input-file))
  {
    i=read(integer)
    if (i>largest) largest=i
  }
  return i++; #larger than the largest integer in the input file
}
like image 44
user5429970 Avatar answered Nov 07 '22 10:11

user5429970