Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Causes for crash during garbage collection

I've been struggling now for some time with a crash in a C# application that also uses a fair share of C++/CLI modules that are mostly wrappers around native libraries to access device drivers. The crash is not always easily reproducible but I was able to collect half a dozen crash dumps that shows that the program always crashes with an access violation during a garbage collection. This is the native callstack and the last event log:

0:000> k
ChildEBP RetAddr  
0012d754 79f95a8f mscorwks!WKS::gc_heap::find_first_object+0x62
0012d7dc 79f933bb mscorwks!WKS::gc_heap::mark_through_cards_for_segments+0x493
0012d814 79f92cbf mscorwks!WKS::gc_heap::mark_phase+0xc3
0012d838 79f93245 mscorwks!WKS::gc_heap::gc1+0x62
0012d84c 79f92f5a mscorwks!WKS::gc_heap::garbage_collect+0x253
0012d878 79f94e26 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x1a9
0012d904 79f926ce mscorwks!WKS::gc_heap::try_allocate_more_space+0x15b
0012d918 79f92769 mscorwks!WKS::gc_heap::allocate_more_space+0x11
0012d938 79e73291 mscorwks!WKS::GCHeap::Alloc+0x3b

0:000> .lastevent
Last event: 7e8.88: Access violation - code c0000005 (first/second chance not available)
  debugger time: Mon Sep 26 11:34:53.646 2011 (UTC + 2:00)

So let me first ask my question and give more details below. My question is: besides a managed heap corruption is there any other cause for a crash during garbage collection?

Now elaborating a bit, the reason I ask this is because I'm having a really hard time trying to identify the code that is corrupting the managed heap and can't seem to find a pattern for the memory that is (supposedly) overwritten.

I already tried to comment all "dangerous" C++/CLI code (specially the parts that use pinned handles) but this didn't help. In trying to find a pattern in the memory that is overwritten I looked at the dissassembled code at the point of the crash:

0:000> u .-a .+a
mscorwks!WKS::gc_heap::find_first_object+0x54:
79f935b9 89450c          mov     dword ptr [ebp+0Ch],eax
79f935bc 8bd0            mov     edx,eax
79f935be 8b02            mov     eax,dword ptr [edx]
79f935c0 83e0fe          and     eax,0FFFFFFFEh
79f935c3 f70000000080    test    dword ptr [eax],80000000h      <<<<CRASH
79f935c9 0f84b1000000    je      mscorwks!WKS::gc_heap::find_first_object+0x73

0:000> r
eax=00000000 ebx=01c81000 ecx=01c80454 edx=01c82fe0 esi=012f0000 edi=000027e1
eip=79f935c3 esp=0012d738 ebp=0012d754 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
mscorwks!WKS::gc_heap::find_first_object+0x62:
79f935c3 f70000000080    test    dword ptr [eax],80000000h ds:0023:00000000=????????

The crash happens when trying to dereference the EAX register which is null. Now from what I see EAX was loaded from contents pointed to by the EDX register so I looked at the address stored there:

0:000> dd @edx-10
01c82fd0  06542778 00000000 00000000 01c82494
01c82fe0  00000000 00000000 00000000 00000000
01c82ff0  01b641d0 00000000 00000000 01c82380

EDIT: I now see that my analysis was wrong, lacking was an understanding of x86 addressing modes.

So I can see that starting at address 01c82fed (the value stored at EDX) the next 16 bytes are null. But when looking at another similar crash dump I see the following:

0:000> dd @edx-10
018defd4  00000000 00000000 00000000 00000000
018defe4  00000000 00000000 018b468c 01742354
018deff4  00e0907f 00000000 00000000 00000000

So here the 16 bytes before address pointed by EDX and the next 8 from there are null. And the same happens in the other crash dumps that I have, I don't see a pattern here, i.e. it doesn't seem that some piece of code is simply overwriting this region of the memory.

Going back to the question what I would like to know is if there is some other explanation for the crash besides one piece of the code overwriting memory that it shouldn't. Or any advice at all in how to proceed, I'm really lost in this one here..

(could the pinned handles cause a problem? We have quite a few of them and what I think that is that is funny is that I always see 137 - no more no less - pinned handles with !gchandles at the point of the crash, it's a strange coincidence for me..).

EDIT: forgot to mention that we're using version 3.5 of the .Net framework. I see reports of similar crashes in .Net 4 when the background GC is active (somewhere there is a mention that this is a bug in .Net) but I don't think that this is relevant here since AFAIK there is no background GC in .Net 3.5.

like image 508
floyd73 Avatar asked Sep 28 '11 08:09

floyd73


2 Answers

Not sure if this helps, but generally don't use destructors or let GC handle unmanaged memory. Use the Dispose pattern instead, and move all destructor code to finalizers instead:

ref class MyClass
{
  UnsafeObject data;
  MyClass()
  {
    data = CreateUnsafeDataObject();
  }
  !MyClass()  // IDisposable.Dispose()
  {
    DeleteUnsafeDataObject(data);
  }
  ~MyClass()  // Destructor
  {

  }
}

This will implement the IDisposable pattern on the object. Call Dispose to clear unmanaged data, and you'll at worst have a better chance of figuring out what exactly happens.

like image 196
GeirGrusom Avatar answered Oct 11 '22 03:10

GeirGrusom


So unfortunately my question was a bit misleading since I was looking for alternative explanations besides a managed heap corruption - which turned out to be the problem in the end (caused by an unsafe copy of an unmanged to managed struct). The problem is now solved and I'm posting my findings here in a separate answer, hope that this is ok.

like image 38
floyd73 Avatar answered Oct 11 '22 02:10

floyd73