Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm Question on File Search Indexing

Tags:

algorithm

There is one question and I have the solution to it also. But I couldn't understand the solution. Kindly help with some set of examples and shower some experience.

Question

Given a file containing roughly 300 million social security numbers (9-digit numbers), find a 9-digit number that is not in the file. You have unlimited drive space but only 2MB of RAM at your disposal.

Answer

In the first step, we build an array 2^16 integers that is initialized to 0 and for every number in the file, we take its 16 most significant bits to index into this array and increment the number.

Since there are less than 2^32 numbers in the file, there is bound to be (at least) one number in the array that is less than 2^16. This tells us that there is at least one number missing among the possible numbers with those upper bits.

In the second pass, we can focus only only on the numbers that match this criterion and use a bit-vector of size 2^16 to identify one of the missing numbers.

like image 878
AGeek Avatar asked May 28 '11 21:05

AGeek


1 Answers

To make the explanation simpler, let's say you have a list of two-digit numbers, where each digit is between 0 and 3, but you can't spare the 16 bits to remember for each of the 16 possible numbers, whether you have already encountered it. What you do is to create an array a of 4 3-bit integers and in a[i], you store how many numbers with the first digit i you encountered. (Two-bit integers wouldn't be enough, because you need the values 0, 4 and all numbers between them.)

If you had the file

00, 12, 03, 31, 01, 32, 02

your array would look like this:

4, 1, 0, 2

Now you know that all numbers starting with 0 are in the file, but for each of the remaining, there is at least one missing. Let's pick 1. We know there is at least one number starting with 1 that is not in the file. So, create an array of 4 bits, for each number starting with 1 set the appropriate bit and in the end, pick one of the bits that wasn't set, in our example if could be 0. Now we have the solution: 10.

In this case, using this method is the difference between 12 bits and 16 bits. With your numbers, it's the difference between 32 kB and 119 MB.

like image 51
svick Avatar answered Oct 07 '22 01:10

svick