Finding the hundred largest numbers in a file of a billion

Question

I went to an interview today and was asked this question:

Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?

I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?

Ferdinand Beyer · Accepted Answer

Obviously the interviewers want you to point out two key facts:

You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
- Get-Size: Get the number of values in the container.
- Find-Min: Get the smallest value.
- Delete-Min: Remove the smallest value to replace it with a new, larger value.
- Insert: Insert another element into the container.

By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.

For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).

In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.

paxdiablo · Answer

Here's my initial algorithm:

create array of size 100 [0..99]. read first 100 numbers and put into array. sort array in ascending order. while more numbers in file:     get next number N.     if N > array[0]:         if N > array[99]:             shift array[1..99] to array[0..98].             set array[99] to N.         else             find, using binary search, first index i where N <= array[i].             shift array[1..i-1] to array[0..i-2].             set array[i-1] to N.         endif     endif endwhile

This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).

You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.

One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):

read first hundred numbers into array 1 and sort them descending. while more numbers:     read up to next hundred numbers into array 2 and sort them descending.     merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).     if more numbers:         read up to next hundred numbers into array 2 and sort them descending.         merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).     else         copy list 3 to list 1.     endif endwhile

I'm not sure, but that may end up being more efficient than the continual shuffling.

The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):

list3.clear() while list3.size() < 100:     while list1.peek() >= list2.peek():         list3.add(list1.pop())     endwhile     while list2.peek() >= list1.peek():         list3.add(list2.pop())     endwhile endwhile

Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.

I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.

As with most interviews, technical skill is one of the the things they're looking at.

Finding the hundred largest numbers in a file of a billion

Tags:

algorithm

sorting

Tracy

2 Answers

Ferdinand Beyer

paxdiablo

Recent Activity

Donate For Us

Finding the hundred largest numbers in a file of a billion

Tags:

algorithm

sorting

Tracy

2 Answers

Ferdinand Beyer

paxdiablo

Related questions

Recent Activity

Donate For Us