We are sorting a 5GB file with 37 fields and sort it with 5 keys. The big file is composed of 1000 files of 5MB each.
After 190 minutes it still hasn't finished.
I am wondering if there are other methods to speed up the sorting. We choose unix sort because we don't want it to use up all the memory, so any memory based approach is not okay.
What is the advantage of sorting each files independently, and then use -m option to merge sort it?
To list all files and sort them by size, use the -S option. By default, it displays output in descending order (biggest to smallest in size). You can output the file sizes in human-readable format by adding the -h option as shown. And to sort in reverse order, add the -r flag as follows.
In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input.
Buffer it in memory using -S
. For example, to use (up to) 50% of your memory as a sorting buffer do:
sort -S 50% file
Note that modern Unix sort
can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using --parallel
. To sort using 4 threads:
sort --parallel=4 file
So all in all, you should put everything into one file and execute something like:
sort -S 50% --parallel=4 file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With