I extensively use a nested data structure based on parallel arrays (SCG.SortedList<K,V>
, looks like Outer<K, Inner<K_delta,V>>
) and a calculation engine that copies the structures a lot. The size of the structures could be very different and many of them will be in LOH, because - even though they are already nested - either inner or outer ones will be long enough and the values are mostly doubles (for doubles the limit is 1000 elements per array after which it goes to LOH, at least on x86).
The use case for the nested data structure is to accumulate high frequency data in a long running process. I could use it as a buffer to aggregate the data and keep only a minimum window required for aggregation (by dropping older inner lists and trimming the outer list), but this copying/trimming itself could create LOH fragmentation problem and take more memory instead of saving it.
After reading deeper about LOH fragmentation I could reason that in my case I will have LOH fragmentation, because what happens frequently inside my process/calculations is exactly what is described in, e.g., a great article about LOH fragmentation by Andrew Hunter: create, grow by random increment, copy arrays...
However, I cannot understand if LOH fragmentation on 64-bit platform is a problem at all? In the last comment to the same article a commenter posits that on 64-bit platformss LOH fragmentation is almost a non-issue because address space is so large that it is hard to exhaust it in reality, while memory holes/empty pages are not backed by real memory. (Another option is that MS failed badly in designing GC).
Could some expert please confirm if that statement in the comment is true for .NET managed memory? Is it true that LOH fragmentation on 64-bit systems is not something one should worry about in most of the cases due to extremely large address space? While writing this question I have found a confirmation for C++, so this question is for managed memory in .NET specifically.
The second question is how it works for 32-bit processes running on 64-bits systems? Does the same applies for both, or 32-bits processes could run out of memory pretty fast if there is LOH fragmentation? (e.g. for Excel add-ins I am limited to 32-bits architecture even on 64-bits systems, for a long time due to legacy add-ins around)
The comment from the linked article is below:
Page Cache Posted by: TruePath Posted on: Thursday, November 24, 2011 at 6:50 AM Message:
Either you are using an old OS/cpu that doesn't support 64 bit addressing (or a cheap version of win 7) or MS has radically failed to take proper advantage of 64 bit addressing.
Win 7 ultimate lets a process address up to 192 gigs, snow leopard 16TB and linux 32 exabytes. You are never going to realistically chew through that entire address space in the LOH. Thus the strategy that MS implemented for allocation sounds like the right one and it's only some artificial limit that is getting in the way.
Remember just because you've allocated a page of memory doesn't mean the OS has to always back that up with a page of RAM. The OS is perfectly free to swap pages out to disk or leave blank pages unbacked. The LOH should be allocating an integer number of psges (4K) for each object. When an object in the LOH is freed by the GC the pages allocated to it should be returned to the OS so they no longer need any backing store and don't take up disk space or strain the page allocation structures.
Indeed, the LOH on a 64 byte system should be substantially faster and less resource intensive than the normsl heaps since no objects are ever copied and all objects get an integer number of pages. Unlike the regulsr heap fragmentation isn't really an issue since this gets handled for you by the hardware supported page tables. Indeed it might be optimal to never GC the LOH until virtual memory limits are hit.
True such a strategy would end up writing huge parts of your LOH out to disk but that happens in a second concurrent process. As long as you aren't allocating objects in the LOH and dirtying their pages faster than your disk can write them you shouldn't see any slowdown except a minor competion for disk IO and a small impact of larger OS page maps. The system won't thrash because most of those pages really are free and thus are never again read back from disk. Without a GC marking objects in the LOH no false page access by the GC occurs. Thus the LIFO algorithm employed by the page cache is a pretty decent GC that trades GC overhead for some disk writes and (maybe) an occasional read.
I guess it would be superior to hold the GC metadata and any embedded pointers for objects in the LOH on a separate page from the rest of the data (it can precede it if you want and the rest of the page can be used for other heaps/metadata) so the GC can still free the pages in the LOH without triggering any unneeded page loads. Since objects in the LOH have few if any pointer members/elements (or are all pointers and must be scanned by the GC anyway) one can segregate those from the data and get the best of both worlds: no disk writes and no false page loads for the GC.
Update: Let it be an assumption that there is LOH fragmentation. The question is about the impact the fragmentation of memory address space has on actual memory on x64 platforms, and how it works in 32-bits processes running on x64.
The question is not about how to avoid it/deal with it and what data structures to use (the article discusses this). I have done a lot of testing to find out that nested sorted lists are 5+ times faster than immutable structures and compress my data by c.40% by using key delta in the inner list (uint16 vs int64), while IntMap/AVL tree take 70/50 bytes overhead per key/value pair. With very realistic 10mn pairs I prefer nested SL. So the expected/possible LOH fragmentation is the price to pay for speed and naive memory compression. I cannot test at full scale yet how much memory is actually "leaking" due to fragmentation, but from what I have read I have reasons to doubt if there are any leaks at all.
There is no bad design on part of the garbage collector in the CLR. The issue is that in order to defragment the LOH you need to make space and then reorder and compact the objects. With large objects, a reorder could be moving several large objects for very little gain in memory (e.g. if you had say 100 objects which were each 0.5MB in size, you would potentially have to copy and reorder 200MB of memory in order to compact this memory space. There is a good explanation of this phenomenon at this link
The 64bit CLR has the same size threshold for the LOH (as this was chosen based on real world applications) and it won't be any more of a problem than it was in the 32bit CLR. What will help is if you move to .Net 4.0+ which has improvements that have been made to the LOH algorithms to prevent out of memory and improve the reuse of holes in the stack. .Net 4.5 even has compaction of the LOH LOH compaction MSDN which would negate most issues with custom applications that deal with large arrays.
You will have an advantage using 64bit simply because of the size of the address space. However none of this discussion negates a quality design of your software. The Garbage Collector is one aspect of application that may be responsible for it running slowly. You should have a look at your algorithms and data structures to ensure that you are getting the efficiency gains you want. If you are approaching the limits of your app and are seeing issues with fragmentation, perhaps you should investigate using collections other than arrays and/or limit the size of your arrays so that they aren't allocated on the LOH.
Measure then optimise :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With