Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading and modifying large text files 3-5GB

I have a rather large file consisting of several million lines and there is the need to check and remove corrupt lines from the file.

I have shamelessly tried File.ReadAllLines but it didn't work. Then I tried to stream lines as below reading from the original file and writing to a new one. While it does the job, it does so in several hours(5+). I have read about using buffers which sounds like the only option but how am I going to keep line integrity in that way?

Solution: StreamWriter moved to outside of while. Instead of split, count is used.

 using (FileStream inputStream = File.OpenRead((localFileToProcess + ".txt")))
 {
    using (StreamReader inputReader = new StreamReader(inputStream, System.Text.Encoding.GetEncoding(1254)))
    {
       using(StreamWriter writer=new StreamWriter(localFileToProcess,true,System.Text.Encoding.GetEncoding(1254)))
       {
          while (!inputReader.EndOfStream)
          {
             if ((tempLineValue = inputReader.ReadLine()).Count(c => c == ';') == 4)
             {
                 writer.WriteLine(tempLineValue);
             }
             else
                 incrementCounter();
          }
       }
    }
}
like image 904
mechanicum Avatar asked Jul 25 '13 08:07

mechanicum


1 Answers

I think that the slowest part in your original code was the creating/disposing StreamWriter. On each Dispose the StreamWriter had to flush all unwritten data to the disc, close file handles, etc. On open OS had to check security permissions, current locks an do many other things as well.

When you started to use only one StreamWriter, its internal write buffer began to work writing the data to the disk in large chunks. Along with skipping closing/opening file for writing this saves a lot of time. Disk I/O is usually the slowest part in application.

Split(';') also had possible speed impact, but I think it was less significant. Anyway, string operations should be done carefully in C#, because strings are immutables and can create a lot of garbage in memory. So, if you can check for a 4 semicolons it is always better than call Split(';') which allocates an array and (in your case) creates 5 strings in memory per each line. When a lot of string operations are performed using immutable strings it may severily hit application performance even without any disc I/O.

As for using StringBuilder in your case - I don't think it helps much, because StreamWriter already has built-in buffering.

like image 101
Artemix Avatar answered Oct 05 '22 12:10

Artemix