Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mono multiprocessing performance issue

I am having severe performance issues when running compute-intensive multiprocessed code on Mono. The simple snippet below, which estimates the value of pi using Monte Carlo methods, demonstrates the issue.

The program spawns a number of threads equal to the number of logical cores on the current machine, and performs an identical computation on each. When run on an Intel Core i7 laptop with Windows 7 using the .NET Framework 4.5, the entire process runs in 4.2 s, and the relative standard deviation among the threads' respective execution times is 2%.

However, when run on the same machine (and operating system) using Mono 2.10.9, the overall execution time shoots up to 18 s. There is a huge variance among the respective threads’ performances, with the fastest completing in just 5.6 s whilst the slowest takes 18 s. The average is 14 s, and the relative standard deviation is 28%.

The cause does not appear to be thread scheduling. Pinning each thread to a distinct core (by calling BeginThreadAffinity and SetThreadAffinityMask) does not have any significant effect on the threads’ durations or variances.

Similarly, running the computation on each thread multiple times (and timing them individually) also gives seemingly ad hoc durations. Thus, the issue doesn’t appear to be caused by per-processor warm-up times either.

What I did find to make a difference was pinning all 8 threads to the same processor. In this case, the overall execution was 25 s, which is only 1% slower than executing 8× the work on a single thread. Furthermore, the relative standard deviation also dropped to under 1%. Thus, the issue lies not in Mono's multithreading per se, but in its multiprocessing.

Does anyone have a solution on how to fix this performance issue?

static long limit = 1L << 26;

static long[] results;
static TimeSpan[] timesTaken;

internal static void Main(string[] args)
{
    int processorCount = Environment.ProcessorCount;

    Console.WriteLine("Thread count: " + processorCount);
    Console.WriteLine("Number of points per thread: " + limit.ToString("N0"));

    Thread[] threads = new Thread[processorCount];            
    results = new long[processorCount];
    timesTaken = new TimeSpan[processorCount];

    for (int i = 0; i < processorCount; ++i)
        threads[i] = new Thread(ComputeMonteCarloPi);

    Stopwatch stopwatch = Stopwatch.StartNew();

    for (int i = 0; i < processorCount; ++i)
        threads[i].Start(i);

    for (int i = 0; i < processorCount; ++i)
        threads[i].Join();

    stopwatch.Stop();

    double average = results.Average();
    double ratio = average / limit;
    double pi = ratio * 4;

    Console.WriteLine("Pi: " + pi);

    Console.WriteLine("Overall duration:   " + FormatTime(stopwatch.Elapsed));
    Console.WriteLine();

    for (int i = 0; i < processorCount; ++i)
        Console.WriteLine("Thread " + i.ToString().PadLeft(2, '0') + " duration: " + FormatTime(timesTaken[i]));

    Console.ReadKey();
}

static void ComputeMonteCarloPi(object o)
{
    int processorID = (int)o;

    Random random = new Random(0);
    Stopwatch stopwatch = Stopwatch.StartNew();

    long hits = SamplePoints(random);

    stopwatch.Stop();

    timesTaken[processorID] = stopwatch.Elapsed;
    results[processorID] = hits;
}

private static long SamplePoints(Random random)
{
    long hits = 0;

    for (long i = 0; i < limit; ++i)
    {
        double x = random.NextDouble() - 0.5;
        double y = random.NextDouble() - 0.5;

        if (x * x + y * y <= 0.25)
            hits++;
    }

    return hits;
}

static string FormatTime(TimeSpan time, int padLeft = 7)
{
    return time.TotalMilliseconds.ToString("N0").PadLeft(padLeft);
}

Output on .NET:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14145541191101
Overall duration:     4,234

Thread 00 duration:   4,199
Thread 01 duration:   3,987
Thread 02 duration:   4,002
Thread 03 duration:   4,032
Thread 04 duration:   3,956
Thread 05 duration:   3,980
Thread 06 duration:   4,036
Thread 07 duration:   4,160

Output on Mono:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14139330387115
Overall duration:    17,890

Thread 00 duration:  10,023
Thread 01 duration:  13,203
Thread 02 duration:  14,776
Thread 03 duration:  15,564
Thread 04 duration:  17,888
Thread 05 duration:  16,776
Thread 06 duration:  16,050
Thread 07 duration:   5,561

Output on Mono, with all threads pinned to same processor:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14139330387115
Overall duration:    25,260

Thread 00 duration:  24,704
Thread 01 duration:  25,191
Thread 02 duration:  24,689
Thread 03 duration:  24,697
Thread 04 duration:  24,716
Thread 05 duration:  24,725
Thread 06 duration:  24,707
Thread 07 duration:  24,720

Output on Mono, single thread:

Thread count: 1
Number of points per thread: 536,870,912
Pi: 3.14153660088778
Overall duration:    25,090
like image 630
Douglas Avatar asked Jul 09 '13 17:07

Douglas


1 Answers

Running with mono --gc=sgen fixed it for me, as expected (using Mono 3.0.10).

The underlying issue is that thread-local allocation for the Boehm garbage collector requires some special tuning when used in conjunction with typed allocation or large blocks. This is not only somewhat non-trivial but also has some downsides: you either make marking more complicated/expensive or you require one freelist per thread and type (well, per memory layout).

Thus, by default, the Boehm GC only supports completely pointer-free memory areas or areas where every word can be a pointer, up to a maximum of 256 bytes or so.

But without thread-local allocation, each allocation acquires a global lock, which becomes a bottleneck.

The SGen garbage collector is custom-written for Mono, specifically designed to work fast in a multi-threaded sytem, and does not have these problems.

like image 96
Reimer Behrends Avatar answered Nov 11 '22 16:11

Reimer Behrends