Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partial sorting algorithm

Say I have 50 million features, each feature comes from disk.

At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.

A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.

Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?

Like a partial sorting algorithm or something like that?

like image 957
Enriquev Avatar asked May 15 '10 12:05

Enriquev


2 Answers

In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.

The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.

like image 189
Tomas Petricek Avatar answered Nov 15 '22 09:11

Tomas Petricek


If you are on Unix, use sort ;)

It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.

like image 30
Matthieu M. Avatar answered Nov 15 '22 09:11

Matthieu M.