I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.
For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.
Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.
We first divide the file into runs such that the size of a run is small enough to fit into the main memory. Then sort each run in the main memory using the merge sort sorting algorithm. Finally merge the resulting runs together into successively bigger runs, until the file is sorted.
Have you considered using the *nix sort
program? in raw terms, it'll probably be faster than most Python scripts.
Use -t $'\t'
to specify that it's tab-separated, -k n
to specify the field, where n
is the field number, and -o outputfile
if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
Will sort input.txt
on its 4th field, and output the result to sorted.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With