Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i sort large csv file without loading to memory

Tags:

c#

file

sorting

csv

I have 20GB+ csv file like this:

**CallId,MessageNo,Information,Number** 
1000,1,a,2
99,2,bs,3
1000,3,g,4
66,2,a,3
20,16,3,b
1000,7,c,4
99,1,lz,4 
...

I must order this file by CallId and MessageNo as asc. (One way is load database->sort->export)

How can i sort this file without loading all lines to memory in c#? (like line by line using streamreader)

Do you know a library for solution? i wait your advice, thanks

like image 950
oguzh4n Avatar asked Sep 09 '11 11:09

oguzh4n


People also ask

How do I sort a large CSV file?

Go through the file once and split the file into smaller files e.g. alphabetically (each file need to be small enough to fit into memory). Then go through each of these smaller files and sort them with Python's sort(). Finally combine each file back into one big file.

How do I sort files larger than memory?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.


1 Answers

You should use OS sort commands. Typically it's just

sort myfile

followed by some mystical switches. These commands typically work well with large files, and there are often options to specify temporary storage on other physical harddrives. See this previous question, and the Windows sort command "man" page. Since Windows sort is not enough for your particular sorting problem, you may want to use GNU coreutils which bring the power of linux sort to Windows.

Solution

Here's what you need to do.

  1. Download GNU Coreutils Binaries ZIP and extract sort.exe from the bin folder to some folder on your machine, for example the folder where your to-be-sorted file is.
  2. Download GNU Coreutils Dependencies ZIP and extract both .dll files to the same folder as sort.exe

Now assuming that your file looks like this:

1000,1,a,2
99,2,bs,3
1000,3,g,4
66,2,a,3
20,16,3,b
1000,7,c,4
99,1,lz,4 

you can write in the command prompt:

sort.exe yourfile.csv -t, -g

which would output:

20,16,3,b
66,2,a,3
99,1,lz,4
99,2,bs,3
1000,1,a,2
1000,3,g,4
1000,7,c,4

See more command options here. If this is what you want, don't forget to provide an output file with the -o switch, like so:

sort.exe yourfile.csv -t, -g -o sorted.csv
like image 80
Gleno Avatar answered Sep 18 '22 08:09

Gleno