Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting a text file with over 100,000,000 records

I have a 5gig text file that needs to be sorted in alphabetical order What is the best algorithm to use?

constraints:

Speed - As fast as possible

Memory - A Pc with 1 Gig Ram running windows XP

like image 251
Charles Faiga Avatar asked Dec 20 '09 07:12

Charles Faiga


2 Answers

I routinely sort text files >2GB with the sort linux command. Usually takes 15 - 30 seconds, depending on server load.

Just do it, it won't take as long as you think.

Update Since you're using Windows XP, you can get the sort command in UnxUtils. I use that one probably more than the linux version, and it's equally as fast.

The bottleneck for huge files really disk speed .. my server above has a fast sata raid. If your machine is a desktop (or laptop), then your 7200 RPM (or 5400) RPM IDE drives will add a few minutes to the job.

like image 184
Seth Avatar answered Nov 03 '22 23:11

Seth


For text files, sort, at least the GNU Coreutils version in Linux and others, works surprisingly fast.

Take a look at the --buffer-size and related options, and set --temporary-directory if your /tmp directory is too small.

Alternatively, if you're really worried how long it might take, you can split up the file into smaller chunks, sort then individually, then merge them together (with sort --merge). Sorting each chunk can be done on different systems in parallel.

like image 35
ZoogieZork Avatar answered Nov 03 '22 23:11

ZoogieZork