Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):

Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"

I know that:

  1. This would run on one machine (not distributed).
  2. The machine that I'd be running this on would have several processors.
  3. The files I'd be sorting could be larger than the physical memory of the machine.
  4. A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
  5. An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
  6. I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
  7. This isn't homework!

Thanks! ♥

like image 727
Sophie Avatar asked Dec 15 '10 18:12

Sophie


People also ask

Which sorting algorithm is best for large data?

For larger data sets it proves to be inefficient so algorithms like merge sort are preferred in that case. Quick Sort is an in-place sort (i.e. it doesn't require any extra storage) so it is appropriate to use it for arrays.

Which sorting algo is used to sort names in a large file?

External merge sort. One example of external sorting is the external merge sort algorithm, which is a K-way merge algorithm. It sorts chunks that each fit in RAM, then merges the sorted chunks together.

Which sorting algorithm is good to sort files of smaller size?

Quicksort. Quicksort is one of the most efficient sorting algorithms, and this makes of it one of the most used as well. The first thing to do is to select a pivot number, this number will separate the data, on its left are the numbers smaller than it and the greater numbers on the right.

How do you sort data in a text file?

Although there's no straightforward way to sort a text file, we can achieve the same net result by doing the following: 1) Use the FileSystemObject to read the file into memory; 2) Sort the file alphabetically in memory; 3) Replace the existing contents of the file with the sorted data we have in memory.


2 Answers

This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.

like image 172
NPE Avatar answered Sep 23 '22 05:09

NPE


Suggest the following resources:

Merge Sort: http://en.wikipedia.org/wiki/Merge_sort

Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)

like image 35
Chris Walton Avatar answered Sep 22 '22 05:09

Chris Walton