Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to sort huge (50-100 GB) files when you have enough memory

There are a lot of discussions on the web on the topic of sorting huge files on Unix when the data will not fit into memory. Generally using mergesort and variants.

Hoewever, if suppose, there was enough memory to fit the entire data into it, what could be the most efficient / fastest way of sorting ? The csv files are ~ 50 GB (> 1 billion rows) and there is enough memory (5x the size of data) to hold the entire data.

I can use the Unix sort, but that still takes > 1 hr. I can use any language necessary, but what I am primarily looking for is speed. I understand we can load the data into say, a columnar type db table and sort, but it's a one-time effort, so looking for something more nimble ...

Thanks in advance.

like image 862
xbsd Avatar asked Jun 26 '13 11:06

xbsd


2 Answers

I know this is old but I figure I'd chime in with what I just figured out in hopes that it may help someone else in the future.

GNU sort as you may already know is pretty fast. Couple that with many CPU cores and a lot of RAM and when you pass in some special flags to GNU's sort and make it extremely fast.

* pay close attention to the 'buffer-size' flag. buffer size is the main reason for this speed-up. ive used parallel flag before and it wasn't as fast by itself.

sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.csv $file

I used a for loop to handle all the files in the folder and sorted huge csv files, by the second key, with a comma delim, only keeping unique values, with the following results:

for file in $(ls -p | grep -v  -E "[0-4/]"); 
do 
    time sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.sorted.csv $file; 
done

real    0m36.041s
user    1m53.291s
sys     0m32.007s

real    0m52.449s
user    1m52.812s
sys     0m38.202s

real    0m50.133s
user    1m41.124s
sys     0m38.595s

real    0m41.948s
user    1m41.080s
sys     0m35.949s

real    0m47.387s
user    1m39.998s
sys     0m34.076s

The input files are 5.5 GB with ~75,000,000 million rows each. The max memory usage I saw while a sort was taking place was a little less then 20 GB. So if it scales proportionally then a 50 GB file should take a little less then 200 GB of space. sorted 27.55 GB of data in under 9 minutes!

like image 94
cigol on Avatar answered Nov 09 '22 17:11

cigol on


Use parallel sorting algorithms for huge data.

Useful topic: Which parallel sorting algorithm has the best average case performance?

like image 24
Zenn Avatar answered Nov 09 '22 17:11

Zenn