Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ThreadLocks and Static Constructors

Given:

A ASP.net Web api application hosted in IIS . The application spawns about 30 app domains each for a plugin which does some external work.

The application serves a lot of users and runs most of the time very well, but some times (after days or even weeks) it becoms a sudden hang.

Problem:

One Webapplication has sometimes "hangs" which result in requiring to restart the w3wp.exe .

After some examinations of dumps in this state we found out that at this moments there are lot of threads ( sometimes about 15.000).

In a normalcase we never observe more then one hundred threads.

The DebugDiag says that there is one thread blocking the others

enter image description here

Now we have seen that in thread 44 (and many many others; about 90%), have the same call at the end:

enter image description here

The Method itself doesn't have any locking or threading behavior.but it has one uncommon thing concerning its static constructor. The ctor looks like this:

   static TimeZoneHelper()
        {
        using (StringReader reader = new StringReader(Resources.TimeZones))
        {
            string line;

            while ((line = reader.ReadLine()) != null)
            {
                string[] parts = line.Split(';');

                TimeZoneInfo timeZone = TimeZoneInfo.FindSystemTimeZoneById(parts[1]);

                timeZones[parts[0]] = timeZone;
            }
        }
}

Furthermore the debug analysis indicates that the application was in a active gc (and as you would potentially ask: we never ever manually trigger a gc.collect) enter image description here

Question Is there an evidence indicating that this type of code is problematic in a static ctor? Even if there is not task or threading code ? Perhaps something related to the GC progress itself (as the object is disposable, even if not having a dispose code?)

TimeZoneHelper

I created a gist containing the main methods of this class inlcuding the ctor and the method which was called TimeZoneHelper.ToTimeZoneOffset:

https://gist.github.com/Gentlehag/9d564555261da0e73366

The main thing the method results in is a Dictionary.TryGet (Which was created in the ctor)


Edit Btw I also want to add that in each appdomain a assembly resolve event is bound. The code can be seen here:

https://gist.github.com/Gentlehag/4726b6d888adb149684d


Important Update I am a colleague and just want to add some more information. We also found another scenario which is very similar. I have the stacktrace from the thread that owns the block:

000000c898897560 00007ff8855b7e5d System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.__Canon, mscorlib]].FindEntry(System.__Canon)
000000c8988975d0 00007ff8855b7d34 System.Collections.Generic.Dictionary`2[[System.__Canon, mscorlib],[System.__Canon, mscorlib]].TryGetValue(System.__Canon, System.__Canon ByRef)
000000c898897610 00007ff88f6152b3 GP.Components.Extensions.AppDomains.RemotingRunner.CurrentDomain_AssemblyResolve(System.Object, System.ResolveEventArgs)
000000c8988978a0 00007ff886f7276c System.AppDomain.OnAssemblyResolveEvent(System.Reflection.RuntimeAssembly, System.String)
000000c898897bd0 00007ff8e4b2a7f3 [GCFrame: 000000c898897bd0] 
000000c898899b78 00007ff8e4b2a7f3 [HelperMethodFrame_PROTECTOBJ: 000000c898899b78] System.Reflection.RuntimeAssembly._nLoad(System.Reflection.AssemblyName, System.String, System.Security.Policy.Evidence, System.Reflection.RuntimeAssembly, System.Threading.StackCrawlMark ByRef, IntPtr, Boolean, Boolean, Boolean)
000000c898899c80 00007ff886f7224e System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly(System.String, System.Globalization.CultureInfo, System.Version, Boolean, System.Threading.StackCrawlMark ByRef)
000000c898899d60 00007ff886f716c8 System.Resources.ManifestBasedResourceGroveler.GetSatelliteAssembly(System.Globalization.CultureInfo, System.Threading.StackCrawlMark ByRef)
000000c898899df0 00007ff885b932fb System.Resources.ManifestBasedResourceGroveler.GrovelForResourceSet(System.Globalization.CultureInfo, System.Collections.Generic.Dictionary`2, Boolean, Boolean, System.Threading.StackCrawlMark ByRef)
000000c898899eb0 00007ff885b92ecb System.Resources.ResourceManager.InternalGetResourceSet(System.Globalization.CultureInfo, Boolean, Boolean, System.Threading.StackCrawlMark ByRef)
000000c898899fa0 00007ff885b92b73 System.Resources.ResourceManager.InternalGetResourceSet(System.Globalization.CultureInfo, Boolean, Boolean)
000000c898899ff0 00007ff885b92014 System.Resources.ResourceManager.GetString(System.String, System.Globalization.CultureInfo)
000000c89889a0a0 00007ff89914aa62 NewRelic.Agent.Core.Config.ConfigurationLoader.InitializeFromXml(System.String, System.String)
000000c89889a140 00007ff89914a838 NewRelic.Agent.Core.Config.ConfigurationLoader.Initialize(System.String)
000000c89889a1a0 00007ff899143be9 NewRelic.Agent.Core.Config.ConfigurationLoader.Initialize()
000000c89889a210 00007ff899123a27 NewRelic.Agent.Core.Agent+AgentSingleton.CreateInstance()
000000c89889a280 00007ff8991239c2 NewRelic.Agent.Core.Singleton`1[[System.__Canon, mscorlib]]..ctor(System.__Canon)
000000c89889a2c0 00007ff89912388b NewRelic.Agent.Core.Agent..cctor()
000000c89889a700 00007ff8e4b2a7f3 [GCFrame: 000000c89889a700] 
000000c89889ce88 00007ff8e4b2a7f3 [PrestubMethodFrame: 000000c89889ce88] NewRelic.Agent.Core.Agent.get_Instance()
000000c89889cef0 00007ff89912358c NewRelic.Agent.Core.AgentShim.GetTracer(System.String, UInt32, System.String, System.String, System.Type, System.String, System.String, System.String, System.Object, System.Object[])
000000c89889d280 00007ff8e4b2a7f3 [DebuggerU2MCatchHandlerFrame: 000000c89889d280]

It is not about the TimeZoneHelper class, but it is interesting that there is a common aspect: Both class load a resource in their static constructor (either the config file for NewRelic or the File with the TimeZones). So the scenario seems to be the following:

  1. Multiple threads try to use the class
  2. The first thread gets the lock for the static constructor and runs this constructor
  3. A resource is loaded and the .NET runtime tries to load a resource assembly.
  4. We catch the AssemblyResolve-event to load the resource assembly and cause a deadlock in some way, the question is how?
like image 807
Boas Enkler Avatar asked Oct 07 '15 14:10

Boas Enkler


People also ask

What is difference between private and static constructor?

1. Static constructor is called before the first instance of class is created, wheras private constructor is called after the first instance of class is created. 2. Static constructor will be executed only once, whereas private constructor is executed everytime, whenever it is called.

What is the difference between constructor and static constructor?

Static constructors are used to initialize the static members of the class and are implicitly called before the creation of the first instance of the class. Non-static constructors are used to initialize the non-static members of the class.

What is a static constructor?

A static constructor is used to initialize any static data, or to perform a particular action that needs to be performed only once. It is called automatically before the first instance is created or any static members are referenced.

Can constructor be static and non-static?

One of the important property of java constructor is that it can not be static. We know static keyword belongs to a class rather than the object of a class. A constructor is called when an object of a class is created, so no use of the static constructor.


1 Answers

Here is my guess on what happens.

UPDATE: I think it's a recursion problem with AssemblyResolve event. Based on comments, a stack overflow did not occur, but there still could be a recursion problem, so the answer still applies.

There is an indication that this bug depends on the order of accessing the resources. Most likely this happens when the first thing is an access to one of the static classes you mentioned.

When you access a resource for the first time, an AssemblyResolve event fires several times. Subsequent resource requests do not result in AssemblyResolve events. This can be demonstrated by following code:

AppDomain.CurrentDomain.AssemblyResolve += (sender, eventArgs) =>
{
    Console.WriteLine("Resolve {0}", eventArgs.Name);
    return null;
};
Console.WriteLine(Resource1.String1);
Console.WriteLine(Resource1.String1);

Result:

Resolve ConsoleApplication1.resources, Version=1.0.0.0, Culture=ru-RU, PublicKeyToken=null
Resolve ConsoleApplication1.resources, Version=1.0.0.0, Culture=ru-RU, PublicKeyToken=null
Resolve ConsoleApplication1.resources, Version=1.0.0.0, Culture=ru, PublicKeyToken=null
Resolve ConsoleApplication1.resources, Version=1.0.0.0, Culture=ru, PublicKeyToken=null
Value from resource
Value from resource

Logger is accessing the resources, and this is indicated by:

000000c898899ff0 00007ff885b92014 System.Resources.ResourceManager.GetString(System.String, System.Globalization.CultureInfo)
000000c89889a0a0 00007ff89914aa62 NewRelic.Agent.Core.Config.ConfigurationLoader.InitializeFromXml(System.String, System.String)
000000c89889a140 00007ff89914a838 NewRelic.Agent.Core.Config.ConfigurationLoader.Initialize(System.String)
000000c89889a1a0 00007ff899143be9 NewRelic.Agent.Core.Config.ConfigurationLoader.Initialize()
000000c89889a210 00007ff899123a27 NewRelic.Agent.Core.Agent+AgentSingleton.CreateInstance()
000000c89889a280 00007ff8991239c2 NewRelic.Agent.Core.Singleton`1[[System.__Canon, mscorlib]]..ctor(System.__Canon)
000000c89889a2c0 00007ff89912388b NewRelic.Agent.Core.Agent..cctor()
000000c89889a700 00007ff8e4b2a7f3 [GCFrame: 000000c89889a700] 
000000c89889ce88 00007ff8e4b2a7f3 [PrestubMethodFrame: 000000c89889ce88] NewRelic.Agent.Core.Agent.get_Instance()
000000c89889cef0 00007ff89912358c NewRelic.Agent.Core.AgentShim.GetTracer(System.String, UInt32, System.String, System.String, System.Type, System.String, System.String, System.String, System.Object, System.Object[])

My conclusion here is that logger could be run successfully without AssemblyResolve any event bound for the first time, and would not ever cause an AssemblyResolve event, if it run for the first time this way.

If you access a resource for the first time from an AssemblyResolve, a recursive call happens, which leads to StackOverflowException. This is easy to model:

AppDomain.CurrentDomain.AssemblyResolve += (sender, eventArgs) =>
{
    Console.WriteLine("Resolve {0}", eventArgs.Name);
    Console.WriteLine(Resource1.String1);
    return null;
};

Console.WriteLine(Resource1.String1);

And there is a call to Logger:

catch
{
    context.RunnerLog.Error(string.Format(CultureInfo.InvariantCulture, "Failed to load assembly {0}.", args.Name));

    result = null;
}

There could be a difference, if logger was initialized before AssemblyResolve event was bound, or there was another condition that did not cause the logger to fire a failing AssemblyResolve event.

When you started with a call to a static class and have an exception in AssemblyResolve, and you supposed to be catching and logging that, the call to logger causes an access to a resource, and that one causes another assembly resolve and this recursion leads to stack overflow.

While first request has a lock on a static class constructor, if that operation was holding for a long time before the StackOverflowException, other requests are blocked, but it does not matter, because they would fail with TypeInitializationException. The latter would never happen, because domain would start to unload after the StackOverflowException anyway.

The fact that it's showing some dictionary Find method on top does not matter also - it's probably the last drop that contributed to a stack overflow.

One thing I would recommend to use another kind of logger inside the AssemblyResolve event handlers.

Another thing is that i would try to avoid any blocking IO requests in static constructors, such as a resource access or manual assembly loading. Just initialize the basic stuff inside, and use another concurrency mechanism for lazy initialization in the public methods themselves.

However, i don't think that the cause of the suspected stackoverflow have to do with the static constructors.

Also, the could be no suspected stack overflow if the recursion went too slowly for stackoverflow to occur. This way the domain could start unloading by other reasons - for example by some resource consumption guard of IIS, such as amount of threads or general memory consumption. This would be likely to happen if requests block for long time.

like image 120
George Polevoy Avatar answered Oct 13 '22 12:10

George Polevoy