There are a lot of discussions on the web on the topic of sorting huge files on Unix when the data will not fit into memory. Generally using mergesort and variants.
Hoewever, if suppose, there was enough memory to fit the entire data into it, what could be the most efficient / fastest way of sorting ? The csv files are ~ 50 GB (> 1 billion rows) and there is enough memory (5x the size of data) to hold the entire data.
I can use the Unix sort, but that still takes > 1 hr. I can use any language necessary, but what I am primarily looking for is speed. I understand we can load the data into say, a columnar type db table and sort, but it's a one-time effort, so looking for something more nimble ...
Thanks in advance.
I know this is old but I figure I'd chime in with what I just figured out in hopes that it may help someone else in the future.
GNU sort as you may already know is pretty fast. Couple that with many CPU cores and a lot of RAM and when you pass in some special flags to GNU's sort and make it extremely fast.
* pay close attention to the 'buffer-size' flag. buffer size is the main reason for this speed-up. ive used parallel flag before and it wasn't as fast by itself.
sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.csv $file
I used a for loop to handle all the files in the folder and sorted huge csv files, by the second key, with a comma delim, only keeping unique values, with the following results:
for file in $(ls -p | grep -v -E "[0-4/]");
do
time sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.sorted.csv $file;
done
real 0m36.041s
user 1m53.291s
sys 0m32.007s
real 0m52.449s
user 1m52.812s
sys 0m38.202s
real 0m50.133s
user 1m41.124s
sys 0m38.595s
real 0m41.948s
user 1m41.080s
sys 0m35.949s
real 0m47.387s
user 1m39.998s
sys 0m34.076s
The input files are 5.5 GB with ~75,000,000 million rows each. The max memory usage I saw while a sort was taking place was a little less then 20 GB. So if it scales proportionally then a 50 GB file should take a little less then 200 GB of space. sorted 27.55 GB of data in under 9 minutes!
Use parallel sorting algorithms for huge data.
Useful topic: Which parallel sorting algorithm has the best average case performance?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With