Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I've found a bug in the JIT/CLR - now how do I debug or reproduce it?

I have a computationally-expensive multi-threaded C# app that seems to crash consistently after 30-90 minutes of running. The error it gives is

The runtime has encountered a fatal error. The address of the error was at 0xec37ebae, on thread 0xbcc. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.

(0xc0000005 is the error-code for Access Violation)

My app does not invoke any native code, or use any unsafe blocks, or even any non-CLS compliant types like uint. In fact, the line of code that the debugger says caused the crash is

overallLength += distanceTravelled;

Where both values are of type double


Given all this, I believe the crash must be due to a bug in the compiler or CLR or JIT. I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin. I've never had to view the CIL-binary, or the compiled JIT output, or the native stacktrace (there is no managed stacktrace at the time of the crash), so I'm not sure how. I can't even figure out how to view the state of all the variables at the time of the crash (VS unfortunately won't tell me like it does after managed-exceptions, and outputting them to console/a file would slow down the app 1000-fold, which is obviously not an option).

So, how do I go about debugging this?


[Edit] Compiled under VS 2010 SP1, running latest version of .Net 4.0 Client Profile. Apparently it's ".Net 4.0C/.Net 4.0E, .Net CLR 1.1.4322"

like image 340
BlueRaja - Danny Pflughoeft Avatar asked Sep 25 '12 20:09

BlueRaja - Danny Pflughoeft


2 Answers

I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin.

"Smaller reproduction" definitely sounds like a great idea here... even if "smaller" won't mean "quicker to reproduce".

Before you even start, try to reproduce the error on another machine. If you can't reproduce it on another machine, that suggests a whole different set of tests to do - hardware, installation etc.

Also, check you're on the latest version of everything. It would be annoying to spend days debugging this (which is likely, I'm afraid) and then end up with a response of "Yes, we know about this - it was a bug in .NET 4 which was fixed in .NET 4.5" for example. If you can reproduce it on a variety of framework versions, that would be even better :)

Next, cut out everything you can in the program:

  • Does it have a user interface at all? If possible, remove that.
  • Does it use a database? See if you can remove all database access: definitely any output which isn't used later, and ideally input too. If you can hard code the input within the app, that would be ideal - but if not, files are simpler for reproductions than database access.
  • Is it data-sensitive? Again, without knowing much about the app it's hard to know whether this is useful, but assuming it's processing a lot of data, can you use a binary search to find a relatively small amount of data which causes the problem?
  • Does it have to be multi-threaded? If you can remove all the threading, obviously that may well then take much longer to reproduce the problem - but does it still happen at all?
  • Try removing bits of business logic: if your app is componentized appropriately, you can probably fake out whole significant components by first creating a stub implementation, and then simply removing the calls.

All of this will gradually reduce the size of the app until it's more manageable. At each step, you'll need to run the app again until it either crashes or you're convinced it won't crash. If you have a lot of machines available to you, that should help...

like image 167
Jon Skeet Avatar answered Nov 10 '22 16:11

Jon Skeet


tl;dr Make sure you're compiling to .Net 4.5


This sounds suspiciously like the same error found here. From the MSDN page:

This bug can be encountered when the Garbage Collector is freeing and compacting memory. The error can happen when the Concurrent Garbage Collection is enabled and a certain combination of foreground Garbage Collection and background Garbage Collection occurs. When this situation happens you will see the same call stack over and over. On the heap you will see one free object and before it ends you will see another free object corrupting the heap.

The fix is to compile to .Net 4.5. If for some reason you can't do this, you can also disable concurrent garbage collection by disabling gcConcurrent in the app.config file:

<configuration>
   <runtime>
       <gcConcurrent enabled="false"/>
   </runtime>
</configuration>

Or just compile to x86.

like image 10
ponsfonze Avatar answered Nov 10 '22 16:11

ponsfonze