Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sorting large text data

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?

I have tried hive. I would like to see if this can be done faster using python.

like image 683
fodon Avatar asked Aug 16 '11 14:08

fodon


People also ask

How do I sort large text files?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.

How do I sort large files with small memory?

Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.

How do you sort a very large file external sorting technique?

We first divide the file into runs such that the size of a run is small enough to fit into the main memory. Then sort each run in the main memory using the merge sort sorting algorithm. Finally merge the resulting runs together into successively bigger runs, until the file is sorted.


1 Answers

Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.

Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file. Example:

sort -t $'\t' -k 4 -o sorted.txt input.txt

Will sort input.txt on its 4th field, and output the result to sorted.txt

like image 119
urschrei Avatar answered Nov 16 '22 04:11

urschrei