Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort a large file on two levels efficiently?

Tags:

unix

sorting

perl

I have a very large file, over 100GB (many billions of lines), and I would like to conduct a two-level sort as quick as possible on a unix system with limited memory. This will be one step in a large perl script, so I'd like to use perl if possible.

So, how can I do this? My data looks like this:

A    129
B    192
A    388
D    148
D    911
A    117

...But for billions of lines. I need to first sort by letter, and then by number. Would it be easier to use a unix sort, like...

sort -k1,2 myfile

Or can I do this all in perl somehow? My system will have something like 16GB ram, but the file is about 100GB.

Thanks for any suggestions!

like image 582
jake9115 Avatar asked Aug 12 '13 17:08

jake9115


2 Answers

The UNIX sort utility can handle sorting large data (e.g. larger than your working 16GB of RAM) by creating temporary working files on disk space.

So, I'd recommend simply using UNIX sort for this as you've suggested, invoking the option -T tmp_dir, and making sure that tmp_dir has enough disk space to hold all of the temporary working files that will be created there.

By the way, this is discussed in a previous SO question.

like image 103
asf107 Avatar answered Nov 17 '22 13:11

asf107


The UNIX sort is the best option for sorting data of this scale. I would recommend use fast compression algorithm LZO for it. It is usually distributed as lzop. Set big sort buffer using -S option. If you have some disk faster than then where you have default /tmp set also -T. Also, if you want sort by a number you have to define sorting number sorting as second sorting field. So you should use line like this for best performance:

LC_ALL=C sort -S 90% --compress-program=lzop -k1,1 -k2n
like image 45
Hynek -Pichi- Vychodil Avatar answered Nov 17 '22 12:11

Hynek -Pichi- Vychodil