Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't my threaded .Net app scale linearly when allocating large amounts of memory?

I’ve run into something strange about the effect of large memory allocations on the scalability of the .Net runtime. In my test application I create lots of strings in a tight loop for a fixed number of cycles and spit out a rate of loop iterations per second. The weirdness comes in when I run this loop in several threads – it appears that the rate does not increase linearly. The problem gets even worse when you create large strings.

Let me show you the results. My machine is an 8gb, 8-core box running Windows Server 2008 R1, 32-bit. It has two 4-core Intel Xeon 1.83ghz (E5320) processors. The "work" performed is a set of alternating calls to ToUpper() and ToLower() on a string. I run the test for one thread, two threads, etc – up to the maximum. The columns in the table below are:

  • Rate: The number of loops across all threads divided by the duration.
  • Linear Rate: The ideal rate if performance were to scale linearly. It is calculated as the rate achieved by one thread multiplied by the number of threads for that test.
  • Variance: Calculated as the percentage by which the rate falls short of the linear rate.

Example 1: 10,000 loops, 8 threads, 1024 chars per string

The first example starts off with one thread, then two threads and eventually runs the test with eight threads. Each thread creates 10,000 strings of 1024 chars each:

Creating 10000 strings per thread, 1024 chars each, using up to 8 threads
GCMode = Server

Rate          Linear Rate   % Variance    Threads
--------------------------------------------------------
322.58        322.58        0.00 %        1
689.66        645.16        -6.90 %       2
882.35        967.74        8.82 %        3
1081.08       1290.32       16.22 %       4
1388.89       1612.90       13.89 %       5
1666.67       1935.48       13.89 %       6
2000.00       2258.07       11.43 %       7
2051.28       2580.65       20.51 %       8
Done.

Example 2: 10,000 loops, 8 threads, 32,000 chars per string

In the second example I’ve increased the number of chars for each string to 32,000.

Creating 10000 strings per thread, 32000 chars each, using up to 8 threads
GCMode = Server

Rate          Linear Rate   % Variance    Threads
--------------------------------------------------------
14.10         14.10         0.00 %        1
24.36         28.21         13.64 %       2
33.15         42.31         21.66 %       3
40.98         56.42         27.36 %       4
48.08         70.52         31.83 %       5
61.35         84.63         27.51 %       6
72.61         98.73         26.45 %       7
67.85         112.84        39.86 %       8
Done.

Notice the difference in variance from the linear rate; in the second table the actual rate is 39% less than the linear rate.

My question is: Why does this app not scale linearly?

My Observations

False Sharing

I initially thought that this could be due to False Sharing but, as you’ll see in the source code, I’m not sharing any collections and the strings are quite big. The only overlap that could exist is at the beginning of one string and the end of another.

Server-mode Garbage Collector

I’m using gcServer enabled=true so that each core gets its own heap and garbage collector thread.

Large Object Heap

I don't think that objects I allocate are being sent to the Large Object Heap because they are under 85000 bytes big.

String Interning

I thought that string values may being shared under the hood due to interningMSDN, so I tried compiling interning disabled. This produced worse results than those shown above

Other data types

I tried the same example using small and large integer arrays, in which I loop through each element and change the value. It produces similar results, following the trend of performing worse with larger allocations.

Source Code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime;
using System.Runtime.CompilerServices;

namespace StackOverflowExample
{
  public class Program
  {
    private static int columnWidth = 14;

    static void Main(string[] args)
    {
      int loopCount, maxThreads, stringLength;
      loopCount = maxThreads = stringLength = 0;
      try
      {
        loopCount = args.Length != 0 ? Int32.Parse(args[0]) : 1000;
        maxThreads = args.Length != 0 ? Int32.Parse(args[1]) : 4;
        stringLength = args.Length != 0 ? Int32.Parse(args[2]) : 1024;
      }
      catch
      {
        Console.WriteLine("Usage: StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]");
        System.Environment.Exit(2);
      }

      float rate;
      float linearRate = 0;
      Stopwatch stopwatch;
      Console.WriteLine("Creating {0} strings per thread, {1} chars each, using up to {2} threads", loopCount, stringLength, maxThreads);
      Console.WriteLine("GCMode = {0}", GCSettings.IsServerGC ? "Server" : "Workstation");
      Console.WriteLine();
      PrintRow("Rate", "Linear Rate", "% Variance", "Threads"); ;
      PrintRow(4, "".PadRight(columnWidth, '-'));

      for (int runCount = 1; runCount <= maxThreads; runCount++)
      {
        // Create the workers
        Worker[] workers = new Worker[runCount];
        workers.Length.Range().ForEach(index => workers[index] = new Worker());

        // Start timing and kick off the threads
        stopwatch = Stopwatch.StartNew();
        workers.ForEach(w => new Thread(
          new ThreadStart(
            () => w.DoWork(loopCount, stringLength)
          )
        ).Start());

        // Wait until all threads are complete
        WaitHandle.WaitAll(
          workers.Select(p => p.Complete).ToArray());
        stopwatch.Stop();

        // Print the results
        rate = (float)loopCount * runCount / stopwatch.ElapsedMilliseconds;
        if (runCount == 1) { linearRate = rate; }

        PrintRow(String.Format("{0:#0.00}", rate),
          String.Format("{0:#0.00}", linearRate * runCount),
          String.Format("{0:#0.00} %", (1 - rate / (linearRate * runCount)) * 100),
          runCount.ToString()); 
      }
      Console.WriteLine("Done.");
    }

    private static void PrintRow(params string[] columns)
    {
      columns.ForEach(c => Console.Write(c.PadRight(columnWidth)));
      Console.WriteLine();
    }

    private static void PrintRow(int repeatCount, string column)
    {
      for (int counter = 0; counter < repeatCount; counter++)
      {
        Console.Write(column.PadRight(columnWidth));
      }
      Console.WriteLine();
    }
  }

  public class Worker
  {
    public ManualResetEvent Complete { get; private set; }

    public Worker()
    {
      Complete = new ManualResetEvent(false);
    }

    public void DoWork(int loopCount, int stringLength)
    {
      // Build the string
      string theString = "".PadRight(stringLength, 'a');
      for (int counter = 0; counter < loopCount; counter++)
      {
        if (counter % 2 == 0) { theString.ToUpper(); }
        else { theString.ToLower(); }
      }
      Complete.Set();
    }
  }

  public static class HandyExtensions
  {
    public static IEnumerable<int> Range(this int max)
    {
      for (int counter = 0; counter < max; counter++)
      {
        yield return counter;
      }
    }

    public static void ForEach<T>(this IEnumerable<T> items, Action<T> action)
    {
      foreach(T item in items)
      {
        action(item);
      }
    }
  }
}

App.Config

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <gcServer enabled="true"/>
  </runtime>
</configuration>

Running the Example

To run StackOverflowExample.exe on your box, call it with these command-line parameters:

StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]

  • loopCount: The number of times each thread will manipulate the string.
  • maxThreads: The number of threads to progress to.
  • stringLength: the number of characters to fill the string with.
like image 255
user141682 Avatar asked Jan 15 '10 15:01

user141682


2 Answers

You may want to look that this question of mine.

I ran into a similar problem that was due to the fact that the CLR performs inter-thread synchronization when allocating memory to avoid overlapping allocations. Now, with the server GC, the locking algorithm may be different - but something along those same lines may be affecting your code.

like image 136
LBushkin Avatar answered Sep 16 '22 23:09

LBushkin


The hardware you're running this on is not capable of linear scaling of multiple processes or threads.

You have a single memory bank. that's a bottle neck (multiple channel memory may improve access, but not for more precess than you have memory banks (seems like the e5320 processor support 1 - 4 memory channels).

There is only one memory controller per physical cpu package (two in your case), that's a bottle neck.

There are 2 l2 caches per cpu package. that's a bottle neck. Cache coherency issues will happen if that cache is exhausted.

this doesn't even get to the OS/RTL/VM issues in managing process scheduling and memory management, which will also contribute to non-linear scaling.

I think you're getting pretty reasonable results. Significant speedup with multiple threads and at each increment to 8...

Truely, have you ever read anything to suggest that commodity multi-cpu hardware is capable of linear scaling of multiple processes/threads? I haven't.

like image 20
SuperMagic Avatar answered Sep 18 '22 23:09

SuperMagic