Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting 1 trillion integers

Given a set of 1 Trillion integers on hard disk, find the smallest 1 million of them. You can fit at most 1 million integers in memory at a time.

One approach is, take the first 1 million out of 1 trillion and sort the 1 million integers and store it back in the hard disk. In this way carry on the sorting for every group of 1 million integers and store it in the hard drive. Now groups of 1 million integers are sorted up to 1 trillion. Now compare the first element of all the sorted groups the minimum of them is the minimum of the 1 trillion. Store it as the first element in the memory. Next, take the second element from the group from which the smallest element came and then check it with all other groups first element. In this way repeat the procedure until the first 1 million is sorted and stored in the memory.

Is there a more optimal approach that I'm missing?

like image 811
dreamer Avatar asked May 27 '11 18:05

dreamer


People also ask

How do you sort an array with 1 million records efficiently?

You can use counting sort. Counting sort (sometimes referred to as ultra sort or math sort) is a sorting algorithm which (like bucket sort) takes advantage of knowing the range of the numbers in the array to be sorted (array A).


2 Answers

You can do this efficiently in O(n log m) by using a heap. ( n = all the numbers, m = the size of the set of numbers you want to find ).

Go through the trillion numbers one at a time. For each new number do one of the following.

  1. If the heap has < 1 million nodes insert the new number into the heap.
  2. If the heap has exactly 1 million nodes and the top node is > than the new number, then pop the top node from the heap, and insert a node with the new number.
  3. If neither 1 or 2 are true then toss the number.

After you go through all the trillion entries then the resulting heap will have the 1 million smallest numbers.

Inserting and deleting from the heap is O(log m). The single pass through the heap is n. So, the algorithm is n*log (m)

like image 87
Himadri Choudhury Avatar answered Oct 01 '22 02:10

Himadri Choudhury


How large are the integers? If they're just 32-bit values, I would simply make an array of 4 billion 64-bit counters on disk, and upon encountering x in the input, increment the counter in position x. In general this approach is extremely costly in space, but proportionally the cost is low whenever the range of possible element values is much smaller than the number of items to be sorted, and best of all it's O(n) in time.

like image 28
R.. GitHub STOP HELPING ICE Avatar answered Oct 01 '22 02:10

R.. GitHub STOP HELPING ICE