I went to an interview today and was asked this question:
Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?
I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?
Obviously the interviewers want you to point out two key facts:
Get-Size
: Get the number of values in the container.Find-Min
: Get the smallest value.Delete-Min
: Remove the smallest value to replace it with a new, larger value.Insert
: Insert another element into the container.By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.
For example, for Fibonacci heaps, the operations Get-Size
, Find-Min
and Insert
all are O(1)
and Delete-Min
is O(log n)
(with n <= 100
in this case).
In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue
from#include <queue>
in C++) which is usually implemented using a heap.
Here's my initial algorithm:
create array of size 100 [0..99]. read first 100 numbers and put into array. sort array in ascending order. while more numbers in file: get next number N. if N > array[0]: if N > array[99]: shift array[1..99] to array[0..98]. set array[99] to N. else find, using binary search, first index i where N <= array[i]. shift array[1..i-1] to array[0..i-2]. set array[i-1] to N. endif endif endwhile
This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).
You may even get bonus points for suggesting the use of optimised shift
operations like memcpy
in C provided you can be sure the overlap isn't a problem.
One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):
read first hundred numbers into array 1 and sort them descending. while more numbers: read up to next hundred numbers into array 2 and sort them descending. merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers). if more numbers: read up to next hundred numbers into array 2 and sort them descending. merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers). else copy list 3 to list 1. endif endwhile
I'm not sure, but that may end up being more efficient than the continual shuffling.
The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):
list3.clear() while list3.size() < 100: while list1.peek() >= list2.peek(): list3.add(list1.pop()) endwhile while list2.peek() >= list1.peek(): list3.add(list2.pop()) endwhile endwhile
Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.
I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.
As with most interviews, technical skill is one of the the things they're looking at.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With