I recently posted an article about a log file reader failing due to out of memory errors > Out of memory error archiving a log file
Before I had a chance to try the simpler methods (calling the log file a name with a date in to prevent archiving anyway) which obviously would have meant rewriting the method etc, I first tried the Garbage Collection option as I had never used it e.g GC.Collect().
This was placed in the first try/catch if a memory error was thrown whilst trying to read the log files contents and it seems to free up half the memory e.g from a debug file used during this process (as the log file is obviously out of action so this is to help me debug after the fact) I get this response from last nights archive process.
Attempt Archive
Take contents of current file with ReadFileString (this is my custom stream reader method I wrote which you can see in the original article)
Taken contents of current file!
In TRY/CATCH - Out of Memory Exception - Try GC.Collect()
Memory used before collection: **498671500**
Memory used after collection: **250841460**
Try again with ReadFileString
How much content have we got? Content Length is **123595955**
We have content from the old log file
So GC.Collect seems to fix the read file issue.
However I am just wondering WHAT kind of memory it is freeing up as from this debug as it is removing 247.83MB of memory when I call GC.Collect().
Therefore I would like to know what sort of objects or memory it is freeing up as I thought .NET was supposed to have good inbuilt garbage collecting and if it is making this much "freeable" memory over time should I be regularly calling GC.Collect() every so often to free up memory or this amount of memory generated just because it failed at the first attempt to read the log file into memory?
Obviously it hasn't been working with large files for some time now until I tried the GC.Collect which I have never used before so I was just wondering where the memory was coming from, when it gets collected normally, and should it be called elsewhere.
This is a windows service app which nooks into a DLL that makes many HTTP calls to a 3rd party API using JSON over the day using multiple timers to control each job it needs to run. Therefore it runs constantly unless I manually stop the service.
So should I call a GC.Collect() once a night anyway just as good practise as from other articles people say to leave Garbage collection to the system but from this instance it has helped to quickly solve a problem where Out of Memory errors were being thrown (I have a 14GB 64 bit computer that this is running on).
Firstly check very carefully you are closing (disposing) all files object, as otherwise internal file buffers will not be released until the GC discovers you have forgotten to close the file.
If you don’t need to copy the file, just rename it (FileInfo.Rename). This is the normal way of coping with log files.
If you don’t need to process the data, use the FileInfo.CopyTo or CopyTo(Stream) method, that why the text will be copied using a sensible small buffer and memory will never need to be allocated to hold all the text at the same time.
If you do need to process the text, read one line at a time, this will result in lots of small strings being created, rather than one very large string. The .net GC is very good at reclaiming small short lived object. A no point should you have the entire log file in memory at the same time. Creating a custom iterator that returns the lines in the file, would be one way of doing it.
We can really only guess. Since you have enough free memory (and non-contiguous virtual address space), the problem is most likely related to not being able to allocate enough contiguous memory. The things that need the most contiguous memory are almost exlusively arrays, like the backing array for your queue. When everything works correctly, address space is compacted regularly (part of the GC) and you maximize available contiguous memory. If this doesn't work, something is preventing compaction from working correctly - for example, pinned handles, like the ones used for I/O.
Why does an explicit GC.Collect()
kind of help? It might very well be that you're in a point where all those pinned handles are released, and the compaction actually works. Try using something like VMMap or CLRProfiler to see how the objects are laid out in the address space - the typical case of compaction issues is when you have something like 99% free space in your memory, but nowhere enough to allocate a new object (strings and arrays don't work well with memory fragmentation). Another typical case is when you neglect to use GC.AddMemoryPressure
when allocating unmanaged memory (e.g. for the buffers), so the GC has no idea that it should really start collecting already. Again, CLRProfiler is very helpful in watching when GC happens, and how it maps to memory usage.
If memory fragmentation is indeed the problem, you need to figure out why. This is actually somewhat complex, and may require some work with something like WinDbg, which is rather hard to use, to say the least. I/O always means some pinned buffers, so if you're doing a lot of I/O in parallel, you're interfering with the proper functioning of the GC. The GC tries to deal with this by creating multiple heaps (depending on the exact configuration of GC you're running, but looking at your case, server GC should really be what you're using - you are running this on Windows Server, right?), and I've seen hundreds of heaps being created to "fix" the fragmentation issue - but ultimately, this is destined to fail.
If you have to use pinned handles, you really want to allocate them once, and reuse them if possible. Pinning is preventing the GC from doing its job, so you should only pin stuff that doesn't need to be moved in memory (large object heap objects, pre-allocated buffers on the bottom of the heap...), or at least pin for as short a time as possible.
In general, reusing buffers is a good idea. Sadly, that means that you want to avoid string
s and similar constructs in code like this - string
s being immutable means that every single line you read needs to be a separately allocated object. Fortunately, you don't necessarily need to deal with string
s in your case - a simple byte[]
buffer will work just as well - just look for 0x13, 0x10
instead of "\r\n"
. The main problem you have is that you need to hold a lot of data in memory at once - you either need to minimize that, or make sure the buffers are allocated where they're used the best; for file data, a LOH buffer will help quite a bit.
One way to avoid so many allocations would be to parse the file looking for end-of-lines and remembering just the offset of the line where you want to start copying. As you go, line-by-line (using the reusable byte[]
buffer), you'll just update the offset of the "at most 100 000th line from the end" rather than allocating and freeing strings. Of course, this does mean you have to read some of the data twice - that's just the price of dealing with data that isn't fixed-length and/or indexed :)
Another approach is to read the file from the end. How well this works is hard to predict, since it depends a lot on how the OS and filesystem are capable of handling backwards reading. In some cases, it's just as good as forward reading - both are sequential reads, it's just about whether the OS/FS is smart enough to figure that out or not. In some cases, it's going to be very expensive - if that's the case, use large file buffers (e.g. 16 MiB instead of the more customary 4 kiB etc.) to squeeze as sequential reads as possible. Counting from the back still doesn't quite allow you to stream the data directly to another file (you'll either need to combine this with the first approach or keep the whole 100 000 lines in memory at once yet again), but it means you only ever read the data you're going to use (the most you over-read is the size of your buffer).
Finally, if all else fails, you could use unmanaged memory for some of the work you're doing. I hope I don't have to say this is much trickier than using managed memory - you'll have to be very careful about proper addressing and bounds-checking, among other things. For a task like yours, it's still quite manageable - ultimately, you're just moving lots of bytes with very little "work". You better understand the unmanaged world well, though - otherwise it's just going to lead to bugs that are very hard to track and fix.
EDIT:
Since you made it clear that the "last 100k items" is a workaround and not the desired solution, the easiest thing is to simply stream the data rather than reading everything to RAM and writing everything in one go. If File.Copy
/File.Move
aren't good enough for you, you can use something like this:
var buffer = new byte[4096];
using (var sourceFile = File.OpenRead(...))
using (var targetFile = File.Create(...))
{
var bytesRead = sourceFile.Read(buffer, 0, buffer.Length);
if (bytesRead == 0) break;
targetFile.Write(buffer, 0, bytesRead);
}
The only memory you need is for the (relatively small) buffers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With