Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting huge Number of Integers from hard disk

Tags:

algorithm

Given 100 GB integer Data on Hard Disk with RAM amounting to 2 GB, how to sort the integers with minimal disk operation. Here fetching one number from disk is considered as one disk operation( though in reality a block of data can be fetched).

We can use additional space on the disk for temporary storage and no need to consider the operations of cleaning up temporary spaces used.

like image 632
Shamim Hafiz - MSFT Avatar asked Oct 25 '10 07:10

Shamim Hafiz - MSFT


People also ask

What is the most efficient way to sort a million integers?

You can use counting sort. Counting sort (sometimes referred to as ultra sort or math sort) is a sorting algorithm which (like bucket sort) takes advantage of knowing the range of the numbers in the array to be sorted (array A).

What sort would you use if you had a large data set on disk and a small amount of RAM to work with?

The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.


1 Answers

As other people have noted, you can use an O(n) counting sort. However there are some additional problems you need to consider. We'll assume you're storing 32-bit integers, so 100GB ~ 27e9 ints.

If all the integers are the same, then it will occur ~27e9 times, which is larger than a 32 bit int. Therefore your counters will have to be 64-bit integers.

With 2GB of RAM, you can only store ~125e6 counters in RAM at any one time. If we can't make any assumptions about the distribution of integers, we would either have to:

  • individually increment the counters on the HD, or
  • ignore all integers we encounter that are not in the counter array we currently have stored in RAM.

I think the latter option is better. Since we need ~4e9 64-bit counters and can only store 2GB, we would need to run through the entire array ~16 times. The first option is clearly no good if we consider encountering a sequence of integers such as 0,1<<31,0. These counters would not be stored in RAM at the same time, and thus at least 2 HD writes are required.

Because of this, I think for the specific size of your problem (100GB), an N-way merge sort would be better, as this would only require reading the entire array log_2(100) ~ 8 times.

However, if the interviewer immediately changed the question to "10TB array, still 2GB of RAM", then the counting sort would easily win.

like image 198
Dijkstra Avatar answered Jan 18 '23 01:01

Dijkstra