Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intermittent crash of w3wp.exe with ThreadAbortException after .NET 4.6 upgrade

For the last couple days we have seen intermittent crashes of the w3wp.exe worker process serving the main application pool for our corporate web site. Sometimes the crashes are isolated, and IIS is able to restart the worker process successfully. But if more than 5 crashes happen in 5 minutes, IIS Rapid Fail Protection kicks in and stops the application pool. Here is an example entry from the Application event log just before the crash:

An unhandled exception occurred and the process was terminated.
Application ID: /LM/W3SVC/2/ROOT
Process ID: 3640
Exception: System.Threading.ThreadAbortException
Message: Thread was being aborted.
StackTrace:    at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Immediately after the crash due to the ThreadAbortException, there is a more serious event logged:

Faulting application name: w3wp.exe, version: 8.0.9200.16384, time stamp: 0x5010885f
Faulting module name: KERNELBASE.dll, version: 6.2.9200.17366, time stamp: 0x554d16f6
Exception code: 0xe0434352
Fault offset: 0x00010192
Faulting process id: 0xe38
Faulting application start time: 0x01d100dc662652d6
Faulting application path: C:\Windows\SysWOW64\inetsrv\w3wp.exe
Faulting module path: C:\Windows\SYSTEM32\KERNELBASE.dll
Report Id: db5b0d5b-6cd0-11e5-9418-005056900458
Faulting package full name: 
Faulting package-relative application ID: 

Now, a ThreadAbortException should never cause w3wp.exe to crash, seeing as it is thrown every time a standard Response.Redirect() is performed. MSDN confirms this, and I also confirmed it with a simple test. However, at least one other person has seen a similar crash recently with a similar environment: Thread.Abort in ASP.NET app causes w3wp.exe to crash. (But that may be an unrelated issue.)

Our environment:

  • Corporate web site with shopping cart and partner web services; targets .NET 4.5. (900,000+ lines of custom code including business logic DLL's and in-house libraries.)
  • 2 VMWare web servers in a load-balanced pool using Windows NLB
  • IIS 8.0 / Windows 2012 Server Standard / .NET 4.6.00081
  • App pool running in 32 bit mode because we have to support a handful of classic ASP pages calling a legacy VB6 DLL.

Background:

A couple days prior to the start of crashes, we upgraded to .NET 4.6. We have the new RyuJIT enabled (the default setting) and we have installed all updates to address the critical compiler issue described here: http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx.

We had also deployed a new version of our web code (as we do several times per week). Naturally we double-checked the code changes for any potential crash vulnerabilities, but none of our changes seem vulnerable to infinite loops, recursive stack overflows, or high memory usage -- the normal culprits when w3wp.exe crashes with an unhandled exception.

Sometimes the crash affects one web server within minutes of another, but other times only one web server is affected.

Things I've tried:

  • Restarted the servers and installed all Windows Updates.
  • Analyzed the IIS logs to see if any suspicious/bad requests are coming in just before the crashes. I couldn't find any pattern -- all the requests look normal.
  • Enabled automatic crash minidumps for w3wp.exe (as described at https://msdn.microsoft.com/en-us/library/bb787181.aspx) and analyzed them using WinDbg. Unfortunately the CLR "stack trace of interest" does not show anything useful, just a couple empty GC frames not related to our code:
> 0:026> !clrstack
> OS Thread Id: 0x1ff0 (26)
> Child SP       IP Call Site
> 2321f96c 771bdf8c [GCFrame: 2321f96c]
> 2321f9a4 771bdf8c [GCFrame: 2321f9a4]

Any ideas?

Update:

We have rolled back .NET 4.6 and recent Windows Updates on our web servers. We have been monitoring this for either 2 or 3 days, depending on when the server was rolled back, and in each case, there have been zero subsequent crashes, despite maintaining the same application code. This pretty definitively proves that either .NET 4.6 or the other Windows Updates caused the intermittent crashing, not our code, because w3wp.exe was previously crashing several times per day.

We are now trying to prove this to Microsoft Support, but it's an uphill battle because the issue was random, intermittent, and we could not reproduce it reliably. (They have provided a dump analysis but it seems to be a red herring.) We are also in the process of reapplying the updates in groups and waiting several days to watch for crashes, in an effort to isolate the faulty update. Obviously this is a tedious process.

Update #2:

We've now re-applied all the pre-.NET 4.6 Windows Updates that were removed in troubleshooting, and the servers have been running for several days without crashes. The only things left to re-apply are .NET 4.6 and its own updates, but my management is understandably reluctant to install things that will likely cause crashes in production. So I'm continuing to work with MS to analyze different crash dumps to pinpoint the problem.

like image 911
Jordan Rieger Avatar asked Oct 07 '15 19:10

Jordan Rieger


2 Answers

You didn't show any code, but the evidence suggests this is an issue with your application code, and not with .NET 4.6 or with ThreadAbortException specifically.

Basic troubleshooting steps here: you said there were code changes AND environment changes; so rule one of them out.

  • Test app on a VM with .NET 4.5 installed. If you do not get error, .NET 4.6 may be the cause.

  • Test older version of your app on same server. If no issue noticed, code change is likely cause.

  • Test app on machine with VS.NET installed, attach to the w3wp.exe process, and debug it (Tools > Attach to Process). Catch the ThreadAbortException and trace through it.

  • If you don't already, you should be logging the event that your w3wp.exe process stops.. though this obviously will not handle all exceptions. Google this, but this guy describes a solution that I also use

  • If you don't already, define an Application_Error handler in Global to log exceptions. Microsoft demonstrates this. Create a System.Web.Configuration option that you can toggle in your web.config file to enable different levels of logging, including writing to a local file, and writing to the windows event logs, for example. You can also install a logging handler tool like Elmah.

  • Create a barebones simple web app and test Response.Redirect to verify whether it crashes the w3wp.exe (worker process) with .NET 4.6. I did this, and it didn't, so I suspect your code. Or possible weird server/patch level emergent issue.. these steps should help you pinpoint it.

Side note: Even though it shouldn't affect the app process, I recommend fixing the Response.Redirect() issues. We did this recently in an Enterprise app, and yes it was a change of wide scope, but we no longer get the TAE exceptions. The fix is simple: just call Response.Redirect(false); and then make sure that there is no code that will run after that function is called (call return for example). This post explains

like image 166
nothingisnecessary Avatar answered Nov 11 '22 12:11

nothingisnecessary


@Jordan Rieger, this bug should be fixed in .NET 4.6.1 Can you please confirm whether the problem is fixed in the new framework? Or if it still persists? Thanks.

like image 26
Swaroop Sridhar Avatar answered Nov 11 '22 13:11

Swaroop Sridhar