Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting 20GB of data

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM. Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.

My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.

So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.

Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.


I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/

So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.

PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.

like image 940
Server Overflow Avatar asked Apr 03 '14 19:04

Server Overflow


1 Answers

I think a way to go is Mergesort, it's a great algorithm for sorting a large amount of fixed records with limited memory.

General idea:

  • read N lines from the input file (a value that allows you to keep the lines in memory)

  • sort these lines and write the sorted lines to file 1

  • repeat with the next N lines to obtain file 2

    ...

  • you reach the end of the input file and you now have M files (each of which is sorted)

  • merge these files into a single file (you'll have to do this in steps as well)


You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).

like image 61
manlio Avatar answered Sep 30 '22 22:09

manlio