I have a few very large files each of 500MB++
size, containing integer values (in fact it's a bit more complex), I'm reading those files in a loop and calculating the max value for all files. By some reason the memory is growing constantly during the processing, it looks like GC never releases the memory, acquired by the previous instances of lines
.
I cannot stream the data and have to use GetFileLines
for each file. Provided the actual amount of memory required to store lines
for one file is 500MB
, why do I get 5GB
of RAM
used after 10 files being processed? Eventually it crashes with Out of Memory exception after 15 files.
Calculation:
int max = int.MinValue;
for (int i = 0; i < 10; i++)
{
IEnumerable<string> lines = Db.GetFileLines(i);
max = Math.Max(max, lines.Max(t=>int.Parse(t)));
}
GetFileLines code:
public static List<string> GetFileLines(int i)
{
string path = GetPath(i);
//
List<string> lines = new List<string>();
string line;
using (StreamReader reader = File.OpenText(path))
{
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
reader.Close();
reader.Dispose(); // should I bother?
}
return lines;
}
For very large file, method ReadLines
would be the best fit because it is deferred execution, it does not load all lines in memory and simple to use:
Math.Max(max, File.ReadLines(path).Max(line => int.Parse(line)));
More information:
http://msdn.microsoft.com/en-us/library/dd383503.aspx
Edit:
This is how ReadLines
implement behind the scene:
public static IEnumerable<string> ReadLines(string fileName)
{
string line;
using (var reader = File.OpenText(fileName))
{
while ((line = reader.ReadLine()) != null)
yield return line;
}
}
Also, it is recommended using parallel processing to improve performance when you have multiple files
You could be crashing because you are keeping references to the parsed result in memory after you are finished processing them (the code you show doesn't do this, but is that the same code you run?). It's highly unlikely that there's such a bug in StreamReader
.
Are you sure you have to read all the file in memory at once? It might be quite possible to use an enumerable sequence of lines as IEnumerable<string>
instead of loading up a List<string>
up front. There is nothing that prohibits this, in this code at least.
Finally, the Close
and Dispose
calls are redundant; using
takes care of that automatically.
Why don't implement that as following:
int max = Int32.MinValue;
using(var reader = File.OpenText(path))
{
while ((line = reader.ReadLine()) != null)
{
int current;
if (Int32.TryParse(line, out current))
max = Math.Max(max, current);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With