Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Leak(?) with StreamReader

I have a few very large files each of 500MB++ size, containing integer values (in fact it's a bit more complex), I'm reading those files in a loop and calculating the max value for all files. By some reason the memory is growing constantly during the processing, it looks like GC never releases the memory, acquired by the previous instances of lines.

I cannot stream the data and have to use GetFileLines for each file. Provided the actual amount of memory required to store lines for one file is 500MB, why do I get 5GB of RAM used after 10 files being processed? Eventually it crashes with Out of Memory exception after 15 files.

Calculation:

   int max = int.MinValue;

   for (int i = 0; i < 10; i++)
   {
      IEnumerable<string> lines = Db.GetFileLines(i);

      max = Math.Max(max, lines.Max(t=>int.Parse(t)));
   }

GetFileLines code:

   public static List<string> GetFileLines(int i)
   {
      string path = GetPath(i);

      //
      List<string> lines = new List<string>();
      string line;

      using (StreamReader reader = File.OpenText(path))
      {
         while ((line = reader.ReadLine()) != null)
         {
            lines.Add(line);
         }

         reader.Close();
         reader.Dispose(); // should I bother?
      }

      return lines;
   }
like image 947
user1514042 Avatar asked Oct 02 '12 11:10

user1514042


3 Answers

For very large file, method ReadLines would be the best fit because it is deferred execution, it does not load all lines in memory and simple to use:

  Math.Max(max, File.ReadLines(path).Max(line => int.Parse(line)));

More information:

http://msdn.microsoft.com/en-us/library/dd383503.aspx

Edit:

This is how ReadLines implement behind the scene:

    public static IEnumerable<string> ReadLines(string fileName)
    {
        string line;
        using (var reader = File.OpenText(fileName))
        {
            while ((line = reader.ReadLine()) != null)
                yield return line;
        }
    }

Also, it is recommended using parallel processing to improve performance when you have multiple files

like image 63
cuongle Avatar answered Sep 23 '22 23:09

cuongle


You could be crashing because you are keeping references to the parsed result in memory after you are finished processing them (the code you show doesn't do this, but is that the same code you run?). It's highly unlikely that there's such a bug in StreamReader.

Are you sure you have to read all the file in memory at once? It might be quite possible to use an enumerable sequence of lines as IEnumerable<string> instead of loading up a List<string> up front. There is nothing that prohibits this, in this code at least.

Finally, the Close and Dispose calls are redundant; using takes care of that automatically.

like image 20
Jon Avatar answered Sep 20 '22 23:09

Jon


Why don't implement that as following:

int max = Int32.MinValue;
using(var reader = File.OpenText(path)) 
{
    while ((line = reader.ReadLine()) != null)
    {
         int current;
         if (Int32.TryParse(line, out current))
             max = Math.Max(max, current);
     }    
}
like image 26
STO Avatar answered Sep 22 '22 23:09

STO