Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux: sorting a 500GB text file with 10^10 records

I have a 500GB text file with about 10 billions rows that needs to be sorted in alphabetical order. What is the best algorithm to use? Can my implementation & set-up be improved ?

For now, I am using the coreutils sort command:

LANG=C
sort -k2,2 --field-separator=',' --buffer-size=(80% RAM) --temporary-directory=/volatile BigFile

I am running this in AWS EC2 on a 120GB RAM & 16 cores virtual machine. It takes the most part of the day.

/volatile is a 10TB RAID0 array.

The 'LANG=C' trick delivers a x2 speed gain (thanks to 1)

By default 'sort' uses 50% of the available RAM. Going up to 80-90% gives some improvement.

My understanding is that gnu 'sort' is a variant of the merge-sort algorithm with O(n log n), which is the fastest : see 2 & 3 . Would moving to QuickSort help (I'm happy with an unstable sort)?

One thing I have noticed is that only 8 cores are used. This is related to default_max_threads set to 8 in linux coreutils sort.c (See 4). Would it help to recompile sort.c with 16 ?

Thanks!


FOLLOW-UP :

@dariusz

I used Chris and your suggestions below.

As the data was already generated in batches: I sorted each bucket separately (on several separate machines) and then used the 'sort --merge' function. Works like a charm and is much faster: O(log N/K) vs O(log N).

I also rethinked the project from scratch: some data post-processing is now performed while the data is generated, so that some un-needed data (noise) can be discarded before sorting takes place.

All together, data size reduction & sort/merge led to massive reduction in computing resources needed to achieve my objective.

Thanks for all your helpful comments.

like image 240
Olivier Delrieu Avatar asked Aug 27 '13 14:08

Olivier Delrieu


2 Answers

The benefit of quicksort over mergesort is no additional memory overhead. The benefit of mergesort is the guaranteed O(n log n) run time, where as quicksort can be much worse in the event of poor pivot point sampling. If you have no reason to be concerned about the memory use, don't change. If you do, just ensure you pick a quicksort implementation that does solid pivot sampling.

I don't think it would help spectacularly to recompile sort.c. It might be, on a micro-optimization scale. But your bottleneck here is going to be memory/disk speed, not amount of processor available. My intuition would be that 8 threads is going to be maxing out your I/O throughput already, and you would see no performance improvement, but this would certainly be dependent on your specific setup.

Also, you can gain significant performance increases by taking advantage of the distribution of your data. For example, evenly distributed data can be sorted very quickly by a single bucket sort pass, and then using mergesort to sort the buckets. This also has the added benefit of decreasing the total memory overhead of mergesort. If the memory comlexity of mergesort is O(N), and you can separate your data into K buckets, your new memory overhead is O(N/K).

like image 51
ChrisCM Avatar answered Nov 09 '22 02:11

ChrisCM


Just an idea:

I assume the file contents are generated for quite a large amout of time. Write an application (script?) which would periodically move the up-till-now generated file to a different location, append its contents to another file, perform a sort on that different file, and repeat until all data is gathered.

That way your system would spend more time sorting, but the results would be available sooner, since sorting partially-sorted data will be faster than sorting the unsorted data.

like image 30
Dariusz Avatar answered Nov 09 '22 03:11

Dariusz