Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the optimal (speed) way of parsing a large (> 4GB) text file with many (milions) of lines?

I'm trying to determine what is the fastest way to read in large text files with many rows, do some processing, and write them to a new file. In C#/.net, it appears StreamReader is a seemingly quick way of doing this but when I try to use for this file (reading line by line), it goes about 1/3 the speed of python's I/O (which worries me because I keep hearing that Python 2.6's IO was relatively slow).

If there isn't faster .Net solution for this, would it be possible to write a solution faster than StreamReader or does it already use complicated buffer/algorithm/optimizations that I would never hope to beat?

like image 337
llamaoo7 Avatar asked Jan 05 '09 23:01

llamaoo7


2 Answers

Do you have a code sample of what your doing, or the format of the file you are reading?

Another good question would be how much of the stream are you keeping in memory at a time?

like image 199
Nathan Tregillus Avatar answered Oct 28 '22 15:10

Nathan Tregillus


StreamReader is pretty good - how were you reading it in Python? It's possible that if you specify a simpler encoding (e.g. ASCII) then that may speed things up. How much CPU is the process taking?

You can increase the buffer size by using the appropriate StreamReader constructor, but I have no idea how much difference that's likely to make.

like image 37
Jon Skeet Avatar answered Oct 28 '22 13:10

Jon Skeet