It's an interview question: <blockquote> There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is one with duplicate, just see if duplicate exists, if duplicate found, return True, or return False. Only 10 MB memory allowed. </blockquote> Here is my solution: Hash all these numbers into 1000 files using <code>hash(num)%1000</code>, then the duplicates should fall into the same file. After the hashing, I got 1000 small files, each of which contains <code>1 million</code> numbers <code>at most</code>, right? I'm not sure about this, I simply do it <code>1 billion / 1000 = 1 million</code>. Then for each file, build a hash table to store each number and a <code>flag</code> representing its occurrence. I guess, it will take <code>5 B</code> to represent the number, <code>4 B</code> for the lower <code>8 digits</code> and <code>1 B</code> for the upper <code>3 digits</code>; and actually <code>1 bit</code> will suffice the <code>flag</code>, because I just need to find out whether duplicate exists, only how many times. But how can I apply the <code>1 bit</code> flag to each number? I'm stumbled, so I choose <code>bool</code> to be the flag, <code>1 B</code> is taken. So finally, each number in the hash table will take <code>5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B</code>, then each file will take <code>10M</code> for the hash table. That's my stupid solution, Please give me a better one. Thanks. FOLLOW UP: <blockquote> If there are <code>no duplicates</code> in these 1 billion phone numbers, given one phone number, how to find out the given one <code>is or is not in</code> these 1 billion numbers? Use as few memory as possible. </blockquote> I came up with 2 solutions, <ol> <li>The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and <code>xor the given number with the one read from the file</code>, if the result is <code>0</code>, then the given one is in the file, it'll take <code>O(n)</code> time, right?</li> <li><code>Partition</code> these numbers into <code>2 small files</code> according to the <code>leading bit</code>, which means, those numbers with a <code>leading 1-bit</code> go to a file, <code>leading 0-bit</code> go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file's <code>count</code> is <code>not full</code>, then <code>again partition</code> the 1-bit file according to the <code>secondary leading-bit</code>, and check the given number recursively; if the 1-bit file <code>is full</code>, then the given number gotta be in the file, it'll take <code>O(logn)</code> time, right?</li> </ol>

Fastest solution (also in terms of programmer overhead :) <pre class="prettyprint"><code># Generate some 'phones' yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt # Split phones.txt in 10MB chunks split -C 10000000 phones.txt # Sort each 10MB chunk with 10MB of memory for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done # Merge the shorted chunks with 10MB of memory sort -S 10M --files0-from=merge.txt -m > sorted.txt # See if there is any duplicates test -z $(uniq -d merge.txt) </code></pre> Check that the memory usage constraint is met with pmap $(pidof sort) for example:

check 1 billion cell-phone numbers for duplicates

Tags:

algorithm

large-data

It's an interview question:

There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is one with duplicate, just see if duplicate exists, if duplicate found, return True, or return False. Only 10 MB memory allowed.

Here is my solution:

Hash all these numbers into 1000 files using hash(num)%1000, then the duplicates should fall into the same file.

After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million.

Then for each file, build a hash table to store each number and a flag representing its occurrence.

I guess, it will take 5 B to represent the number, 4 B for the lower 8 digits and 1 B for the upper 3 digits; and actually 1 bit will suffice the flag, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit flag to each number? I'm stumbled, so I choose bool to be the flag, 1 B is taken. So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B, then each file will take 10M for the hash table.

That's my stupid solution, Please give me a better one.

Thanks.

FOLLOW UP:

If there are no duplicates in these 1 billion phone numbers, given one phone number, how to find out the given one is or is not in these 1 billion numbers? Use as few memory as possible.

I came up with 2 solutions,

The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and xor the given number with the one read from the file, if the result is 0, then the given one is in the file, it'll take O(n) time, right?
Partition these numbers into 2 small files according to the leading bit, which means, those numbers with a leading 1-bit go to a file, leading 0-bit go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file's count is not full, then again partition the 1-bit file according to the secondary leading-bit, and check the given number recursively; if the 1-bit file is full, then the given number gotta be in the file, it'll take O(logn) time, right?

901

asked Oct 09 '11 11:10

Alcott

1 Answers

Fastest solution (also in terms of programmer overhead :)

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

Check that the memory usage constraint is met with pmap $(pidof sort) for example:

103

answered Oct 11 '22 01:10

piotr

Related questions
                            
                                How do I generate a random string of up to a certain length?
                            
                                How to calculate indefinite integral programmatically
                            
                                Traversal of cyclic directed graph
                            
                                Stuck with an interview Question... Partitioning of an Array
                            
                                Largest rectangle of 1's in 2d binary matrix
                            
                                Why siftDown is better than siftUp in heapify?
                            
                                How to convert decimal fractions to hexadecimal fractions?
                            
                                Using BFS for topological sort
                            
                                How does random shuffling in quick sort help in increasing the efficiency of the code?
                            
                                Minimal addition to strongly connected graph
                            
                                Given an integer N. What is the smallest integer greater than N that only has 0 or 1 as its digits?
                            
                                Given an encoded message, count the number of ways it can be decoded
                            
                                Looking for algorithm finding euler path
                            
                                Why is random jitter applied to back-off strategies?
                            
                                Get the consecutive numbers whose sum matches with given number
                            
                                What problems can be solved, or tackled more easily, using graphs and trees? [closed]
                            
                                Graph Isomorphism
                            
                                What's the way to determine if an Int is a perfect square in Haskell?
                            
                                How to convert a recursive function to use a stack?
                            
                                Set time and speed complexity

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With