What heuristic uses TPL to determine when to use multiple cores

Question

We knows that TPL (so PLINQ too) doesn't consume all cores if he think that task is easy and executes it on single core. But he does it even for a complicated task! For example, here is code from article about Java parallelism:

import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;
import java.math.BigInteger;

@Warmup(iterations=5)
@Measurement(iterations=10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(2)
public class Factorial {
    private static final BigInteger ONE = BigInteger.valueOf(1);

    @Param({"10", "100", "1000", "10000", "50000"})
    private int n;

    public static BigInteger naive(int n) {
        BigInteger r = ONE;
        for (int i = 2; i <= n; ++i)
            r = r.multiply(BigInteger.valueOf(i));
        return r;
    }

    public static BigInteger streamed(int n) {
        if(n < 2) return ONE;
        return IntStream.rangeClosed(2, n).mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get();
    }

    public static BigInteger streamedParallel(int n) {
        if(n < 2) return ONE;
        return IntStream.rangeClosed(2, n).parallel().mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get();
    }

    public static BigInteger fourBlocks(int n) {
        if(n < 2) return ONE;
        BigInteger r1 = ONE, r2 = ONE, r3 = ONE, r4 = ONE;
        int i;
        for (i = n; i > 4; i -= 4)
        {
            r1 = r1.multiply(BigInteger.valueOf(i));
            r2 = r2.multiply(BigInteger.valueOf(i - 1));
            r3 = r3.multiply(BigInteger.valueOf(i - 2));
            r4 = r4.multiply(BigInteger.valueOf(i - 3));
        }
        int mult = i == 4 ? 24 : i == 3 ? 6 : i == 2 ? 2 : 1;
        return r1.multiply(r2).multiply(r3.multiply(r4)).multiply(BigInteger.valueOf(mult));
    }

    public static BigInteger streamedShift(int n) {
        if(n < 2) return ONE;
        int p = 0, c = 0;
        while ((n >> p) > 1) {
            p++;
            c += n >> p;
        }
        return IntStream.rangeClosed(2, n).map(i -> i >> Integer.numberOfTrailingZeros(i))
                .mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get().shiftLeft(c);
    }

    public static BigInteger streamedParallelShift(int n) {
        if(n < 2) return ONE;
        int p = 0, c = 0;
        while ((n >> p) > 1) {
            p++;
            c += n >> p;
        }
        return IntStream.rangeClosed(2, n).parallel().map(i -> i >> Integer.numberOfTrailingZeros(i))
                .mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get().shiftLeft(c);
    }

    @Benchmark    
    public void testNaive(Blackhole bh) {
        bh.consume(naive(n));
    }

    @Benchmark    
    public void testStreamed(Blackhole bh) {
        bh.consume(streamed(n));
    }

    @Benchmark    
    public void testStreamedParallel(Blackhole bh) {
        bh.consume(streamedParallel(n));
    }

    @Benchmark    
    public void testFourBlocks(Blackhole bh) {
        bh.consume(fourBlocks(n));
    }

    @Benchmark    
    public void testStreamedShift(Blackhole bh) {
        bh.consume(streamedShift(n));
    }

    @Benchmark    
    public void testStreamedParallelShift(Blackhole bh) {
        bh.consume(streamedParallelShift(n));
    }
}

and results:

Benchmark                              (n)  Mode  Cnt       Score       Error  Units
Factorial.testFourBlocks                10  avgt   20       0.409 ±     0.027  us/op
Factorial.testFourBlocks               100  avgt   20       4.752 ±     0.147  us/op
Factorial.testFourBlocks              1000  avgt   20     113.801 ±     7.159  us/op
Factorial.testFourBlocks             10000  avgt   20   10626.187 ±    54.785  us/op
Factorial.testFourBlocks             50000  avgt   20  281522.808 ± 13619.674  us/op
Factorial.testNaive                     10  avgt   20       0.297 ±     0.002  us/op
Factorial.testNaive                    100  avgt   20       5.060 ±     0.036  us/op
Factorial.testNaive                   1000  avgt   20     277.902 ±     1.311  us/op
Factorial.testNaive                  10000  avgt   20   32471.921 ±  1092.640  us/op
Factorial.testNaive                  50000  avgt   20  970355.227 ± 64386.653  us/op
Factorial.testStreamed                  10  avgt   20       0.326 ±     0.002  us/op
Factorial.testStreamed                 100  avgt   20       5.393 ±     0.190  us/op
Factorial.testStreamed                1000  avgt   20     265.550 ±     1.772  us/op
Factorial.testStreamed               10000  avgt   20   29871.366 ±   234.457  us/op
Factorial.testStreamed               50000  avgt   20  894549.237 ±  5453.425  us/op
Factorial.testStreamedParallel          10  avgt   20       6.114 ±     0.500  us/op
Factorial.testStreamedParallel         100  avgt   20      10.719 ±     0.786  us/op
Factorial.testStreamedParallel        1000  avgt   20      72.225 ±     0.509  us/op
Factorial.testStreamedParallel       10000  avgt   20    2811.977 ±    14.599  us/op
Factorial.testStreamedParallel       50000  avgt   20   49501.716 ±   729.646  us/op
Factorial.testStreamedParallelShift     10  avgt   20       6.684 ±     0.549  us/op
Factorial.testStreamedParallelShift    100  avgt   20      11.176 ±     0.779  us/op
Factorial.testStreamedParallelShift   1000  avgt   20      71.056 ±     3.918  us/op
Factorial.testStreamedParallelShift  10000  avgt   20    2641.108 ±   142.571  us/op
Factorial.testStreamedParallelShift  50000  avgt   20   46480.544 ±   405.648  us/op
Factorial.testStreamedShift             10  avgt   20       0.402 ±     0.006  us/op
Factorial.testStreamedShift            100  avgt   20       5.086 ±     0.039  us/op
Factorial.testStreamedShift           1000  avgt   20     237.279 ±     1.566  us/op
Factorial.testStreamedShift          10000  avgt   20   27572.709 ±   135.489  us/op
Factorial.testStreamedShift          50000  avgt   20  874699.213 ± 53645.087  us/o

you can see that multithreaded version executed ~19 times faster than single thread (Core i7-4702MQ used). But in C# version

static BigInteger Streamed(int n)
{
    return n < 2 ? 1 : Enumerable.Range(2, n - 1).Aggregate(BigInteger.One, (acc, elm) => acc*elm);
}

static BigInteger StreamedParallel(int n)
{
    return n < 2 ? 1 : Enumerable.Range(2, n - 1).AsParallel().Aggregate(BigInteger.One, (acc, elm) => acc * elm);
}

this code has worst performance comparing to all others, which is not surprising because of TPL overhead without performance benefit from multithreading.

So question is: why Java standard multithread library is so wise (any operation that takes 100us+ will be boosted, see reference http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html ), while C# cannot boost an operation that takes 1500ms on my machine.

I like C# and not really very like Java, this is why it hurts and I want to learn why it is what it is...

Kevin Gosse · Accepted Answer

When using the Aggregate method like this, PLinq will execute the aggregation sequentially, and therefore on a single thread. Of course, the multiplication could be executed in any order, but PLinq has no way to guess that. If the operation was a division for instance, changing the order of execution would change the end result.

One way to tell PLinq the query can be parallelized, is to use another Aggregate overload, that indicates how results from multiple threads can be merged:

return n < 2 ? 1 : Enumerable.Range(2, n - 1).AsParallel().Aggregate(BigInteger.One, (acc, elm) => acc * elm, (i, j) => i * j, i => i);

With this version, with n = 100000, it takes around 9000 ms for the sequential version, and 4400 ms for the parallel version. That's pretty much twice as fast, which is consistent with my hardware (dual core processor).

You can read this article for more information about how aggregations work with PLinq: http://blogs.msdn.com/b/pfxteam/archive/2008/01/22/7211660.aspx

What heuristic uses TPL to determine when to use multiple cores

Tags:

java

performance

c#

.net

multithreading

Alex Zhukovskiy

1 Answers

Kevin Gosse

Recent Activity

Donate For Us

What heuristic uses TPL to determine when to use multiple cores

Tags:

java

performance

c#

.net

multithreading

Alex Zhukovskiy

1 Answers

Kevin Gosse

Related questions

Recent Activity

Donate For Us