Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel.For() slows down with repeated execution. What should I look at?

I wrote a naive Parallel.For() loop in C#, shown below. I also did the same work using a regular for() loop to compare single-thread vs. multi-thread. The single thread version took about five seconds every time I ran it. The parallel version took about three seconds at first, but if I ran it about four times, it would slow down dramatically. Most often it took about thirty seconds. One time it took eighty seconds. If I restarted the program, the parallel version would start out fast again, but slow down after three or four parallel runs. Sometimes the parallel runs would speed up again to the original three seconds then slow down.

I wrote another Parallel.For() loop for computing Mandelbrot set members (discarding the results) because I figured that the problem might be related to memory issues allocating and manipulating a large array. The Parallel.For() implementation of this second problem does indeed execute faster than the single-thread version every time, and the times are consistent too.

What data should I be looking at to understand to understand why my first naive program slows down after a number of runs? Is there something in Perfmon I should be looking at? I still suspect it is memory related, but I allocate the array outside the timer. I also tried a GC.Collect() at the end of each run, but that didn't seem help, not consistently anyway. Might it be an alignment issue with cache somewhere on the processor? How would I figure that out? Is there anything else that might be the cause?

JR

    const int _meg = 1024 * 1024;
    const int _len = 1024 * _meg;

    private void ParallelArray() {
        int[] stuff = new int[_meg];
        System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
        lblStart.Content = DateTime.Now.ToString();
        s.Start();

        Parallel.For(0,
            _len,
            i => {
                stuff[i % _meg] = i;
            }
            );
        s.Stop();

        lblResult.Content = DateTime.Now.ToString();

        lblDiff.Content = s.ElapsedMilliseconds.ToString();

    }
like image 879
jrv Avatar asked Sep 11 '14 17:09

jrv


2 Answers

I have profiled your code and it indeed looks strange. There should be no deviations. It is not an allocation issue (GC is fine and you are allocating only one array per run).

The problem can be reproduced on my Haswell CPU where the parallel version suddenly takes much longer to execute. I have CLR version 4.0.30319.34209 FX452RTMGDR.

On x64 it works fine and has no issues. Only x86 builds seem to suffer from it. I have profiled it with the Windows Performance Toolkit and have found that it looks like a CLR issue where the TPL tries to find the next workitem. Sometimes it happens that the call

System.Threading.Tasks.RangeWorker.FindNewWork(Int64 ByRef, Int64 ByRef)
System.Threading.Tasks.Parallel+<>c__DisplayClassf`1[[System.__Canon, mscorlib]].<ForWorker>b__c()
System.Threading.Tasks.Task.InnerInvoke()
System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)
System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object)
System.Threading.Tasks.Task.InnerInvoke()

seems to "hang" in the clr itself. clr!COMInterlocked::ExchangeAdd64+0x4d

When I compare the sampled stacks with a slow and fast run I find:

ntdll.dll!__RtlUserThreadStart  -52%
kernel32.dll!BaseThreadInitThunk  -52%
ntdll.dll!_RtlUserThreadStart  -52% 
clr.dll!Thread::intermediateThreadProc  -48%
clr.dll!ThreadpoolMgr::ExecuteWorkRequest  -48%
clr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem  -48%
clr.dll!ManagedThreadBase_FullTransitionWithAD  -48%
clr.dll!ManagedThreadBase_DispatchOuter  -48%
clr.dll!ManagedThreadBase_DispatchMiddle  -48%
clr.dll!ManagedThreadBase_DispatchInner  -48%
clr.dll!QueueUserWorkItemManagedCallback  -48% 
clr.dll!MethodDescCallSite::CallTargetWorker  -48%
clr.dll!CallDescrWorkerWithHandler  -48%
mscorlib.ni.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteEntry(Boolean)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.TaskByRef)  -48%
mscorlib.ni.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext System.Threading.ContextCallback System.Object Boolean)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecutionContextCallback(System.Object)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.Execute()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke()  -48%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0  -24%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0<itself>  -24%
...
clr.dll!COMInterlocked::ExchangeAdd64  +50%

In the dysfunctional case most of the time (50%) is spent in clr.dll!COMInterlocked::ExchangeAdd64. This method was compiled with FPO since the stacks were broken in the middle to get more performance. I have thought that such code is not allowed in the Windows Code base because it makes profiling harder. Looks like the optimizations have gone too far. When I single step with the debugger to the actual exachange operation

eax=01c761bf ebx=01c761cf ecx=00000000 edx=00000000 esi=00000000 edi=0274047c
eip=747ca4bd esp=050bf6fc ebp=01c761bf iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
clr!COMInterlocked::ExchangeAdd64+0x49:
747ca4bd f00fc70f        lock cmpxchg8b qword ptr [edi] ds:002b:0274047c=0000000001c761bf

cmpxchg8b compares EDX:EAX=1c761bf with the memory location and if the values equal copy the new value of ECX:EBX=1c761cf to the memory location. When you look at the registers you find that at index 0x1c761bf = 29.843.903 all values are not equal. Looks like there is an race condition (or excessive contention) when incrementing the global loop counter which surfaces only when your method body does so little work that it pops out.

Congrats you have found a real bug in the .NET Framework! You should report it at the connect website to make them aware of this issue.

To be absolutely sure that it is not another issue you can try the parallel loop with an empty delegate:

    System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
    s.Start();
    Parallel.For(0,_len, i => {});
    s.Stop();
    System.Console.WriteLine(s.ElapsedMilliseconds.ToString());

This does also repro the issue. It is therefore definitely a CLR issue. Normally we at SO tell people to not try to write lock free code since it is very hard to get right. But even the smartest guys at MS seem to get it wrong sometimes ....

Update: I have opened a bug report here: https://connect.microsoft.com/VisualStudio/feedbackdetail/view/969699/parallel-for-causes-random-slowdowns-in-x86-processes

like image 56
Alois Kraus Avatar answered Nov 14 '22 15:11

Alois Kraus


Based on your program, I wrote a program to reproduce the problem. I think it is related to .NET large object heap and how Parallel.For implemented.

class Program
    {
        static void Main(string[] args)
        {
            for (int i = 0; i < 10; i++)
                //ParallelArray();
                SingleFor();
        }

        const int _meg = 1024 * 1024;
        const int _len = 1024 * _meg;

         static void ParallelArray()
        {
            int[] stuff = new int[_meg];
            System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();           
            s.Start();
            Parallel.For(0,
                _len,
                i =>
                {
                    stuff[i % _meg] = i;
                }
                );
            s.Stop();          

         System.Console.WriteLine( s.ElapsedMilliseconds.ToString());

        }

         static void SingleFor()
         {
             int[] stuff = new int[_meg];
             System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();

             s.Start();

             for (int i = 0; i < _len; i++){
                     stuff[i % _meg] = i;
                 }

             s.Stop();            

             System.Console.WriteLine(s.ElapsedMilliseconds.ToString());
         }
    }

I compiled with VS2013, release version, and run it without debugger. If the function ParallelArray() is called in the main loop, the result I got is:

1631
1510
51302
1874
45243
2045
1587
1976
44257
1635

if the function SingleFor() is called, the result is:

898
901
897
897
897
898
897
897
899
898

I go through some documentation on MSDN about Parallel.For, this caught my attention: Writing to shared variables. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. As in the Parallel for loop, we're using a shared variable stuff.

This article Parallel Aggregation explain how .NET deal with this case:The Parallel Aggregation pattern uses unshared, local variables that are merged at the end of the computation to give the final result. Using unshared, local variables for partial, locally calculated results is how the steps of a loop can become independent of each other. Parallel aggregation demonstrates the principle that it's usually better to make changes to your algorithm than to add synchronization primitives to an existing algorithm. This means it creates local copies of data instead of using locks to guard the shared variable, and at the end, these 10 partitions needs to be combined together; this brings performance penalties.

When I run the test program with Parall.For, I used the the process explore to count the threads, it has 11 threads, so Parallel.For create 10 partitions for the loops, which mean it creates 10 local copies with size 100K, these object will be placed on Large Object Heap.

There are two different types of heaps in .NET. The Small Object Heap (SOH) and the Large Object Heap (LOH). If the object size is large than 85,000 bytes, it is in LOH. When doing GC, .NET treat the 2 heaps differently.

As it is explained in this blog: No More Memory Fragmentation on the .NET Large Object Heap: One of the key differences between the heaps is that the SOH compacts memory and hence reduces the chance of memory fragmentation dramatically while the LOH does not employ compaction. As a result, excessive usage of the LOH may result in memory fragmentation that can become severe enough to cause problems in applications.

As you're allocating big arrays with size > 85,000 continuously, when the LOH becomes in memory fragmentation, the performance goes down.

If you're using .NET 4.5.1, you can set GCSettings.LargeObjectHeapCompactionMode to CompactOnce to make LOH compact after GC.Collect().

Another good article to understand this problem is: Large Object Heap Uncovered

Further investigation is needed, but I don't have time now.

like image 38
Matt Avatar answered Nov 14 '22 14:11

Matt