We knows that TPL
(so PLINQ
too) doesn't consume all cores if he think that task is easy and executes it on single core. But he does it even for a complicated task! For example, here is code from article about Java parallelism:
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;
import java.math.BigInteger;
@Warmup(iterations=5)
@Measurement(iterations=10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(2)
public class Factorial {
private static final BigInteger ONE = BigInteger.valueOf(1);
@Param({"10", "100", "1000", "10000", "50000"})
private int n;
public static BigInteger naive(int n) {
BigInteger r = ONE;
for (int i = 2; i <= n; ++i)
r = r.multiply(BigInteger.valueOf(i));
return r;
}
public static BigInteger streamed(int n) {
if(n < 2) return ONE;
return IntStream.rangeClosed(2, n).mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get();
}
public static BigInteger streamedParallel(int n) {
if(n < 2) return ONE;
return IntStream.rangeClosed(2, n).parallel().mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get();
}
public static BigInteger fourBlocks(int n) {
if(n < 2) return ONE;
BigInteger r1 = ONE, r2 = ONE, r3 = ONE, r4 = ONE;
int i;
for (i = n; i > 4; i -= 4)
{
r1 = r1.multiply(BigInteger.valueOf(i));
r2 = r2.multiply(BigInteger.valueOf(i - 1));
r3 = r3.multiply(BigInteger.valueOf(i - 2));
r4 = r4.multiply(BigInteger.valueOf(i - 3));
}
int mult = i == 4 ? 24 : i == 3 ? 6 : i == 2 ? 2 : 1;
return r1.multiply(r2).multiply(r3.multiply(r4)).multiply(BigInteger.valueOf(mult));
}
public static BigInteger streamedShift(int n) {
if(n < 2) return ONE;
int p = 0, c = 0;
while ((n >> p) > 1) {
p++;
c += n >> p;
}
return IntStream.rangeClosed(2, n).map(i -> i >> Integer.numberOfTrailingZeros(i))
.mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get().shiftLeft(c);
}
public static BigInteger streamedParallelShift(int n) {
if(n < 2) return ONE;
int p = 0, c = 0;
while ((n >> p) > 1) {
p++;
c += n >> p;
}
return IntStream.rangeClosed(2, n).parallel().map(i -> i >> Integer.numberOfTrailingZeros(i))
.mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get().shiftLeft(c);
}
@Benchmark
public void testNaive(Blackhole bh) {
bh.consume(naive(n));
}
@Benchmark
public void testStreamed(Blackhole bh) {
bh.consume(streamed(n));
}
@Benchmark
public void testStreamedParallel(Blackhole bh) {
bh.consume(streamedParallel(n));
}
@Benchmark
public void testFourBlocks(Blackhole bh) {
bh.consume(fourBlocks(n));
}
@Benchmark
public void testStreamedShift(Blackhole bh) {
bh.consume(streamedShift(n));
}
@Benchmark
public void testStreamedParallelShift(Blackhole bh) {
bh.consume(streamedParallelShift(n));
}
}
and results:
Benchmark (n) Mode Cnt Score Error Units
Factorial.testFourBlocks 10 avgt 20 0.409 ± 0.027 us/op
Factorial.testFourBlocks 100 avgt 20 4.752 ± 0.147 us/op
Factorial.testFourBlocks 1000 avgt 20 113.801 ± 7.159 us/op
Factorial.testFourBlocks 10000 avgt 20 10626.187 ± 54.785 us/op
Factorial.testFourBlocks 50000 avgt 20 281522.808 ± 13619.674 us/op
Factorial.testNaive 10 avgt 20 0.297 ± 0.002 us/op
Factorial.testNaive 100 avgt 20 5.060 ± 0.036 us/op
Factorial.testNaive 1000 avgt 20 277.902 ± 1.311 us/op
Factorial.testNaive 10000 avgt 20 32471.921 ± 1092.640 us/op
Factorial.testNaive 50000 avgt 20 970355.227 ± 64386.653 us/op
Factorial.testStreamed 10 avgt 20 0.326 ± 0.002 us/op
Factorial.testStreamed 100 avgt 20 5.393 ± 0.190 us/op
Factorial.testStreamed 1000 avgt 20 265.550 ± 1.772 us/op
Factorial.testStreamed 10000 avgt 20 29871.366 ± 234.457 us/op
Factorial.testStreamed 50000 avgt 20 894549.237 ± 5453.425 us/op
Factorial.testStreamedParallel 10 avgt 20 6.114 ± 0.500 us/op
Factorial.testStreamedParallel 100 avgt 20 10.719 ± 0.786 us/op
Factorial.testStreamedParallel 1000 avgt 20 72.225 ± 0.509 us/op
Factorial.testStreamedParallel 10000 avgt 20 2811.977 ± 14.599 us/op
Factorial.testStreamedParallel 50000 avgt 20 49501.716 ± 729.646 us/op
Factorial.testStreamedParallelShift 10 avgt 20 6.684 ± 0.549 us/op
Factorial.testStreamedParallelShift 100 avgt 20 11.176 ± 0.779 us/op
Factorial.testStreamedParallelShift 1000 avgt 20 71.056 ± 3.918 us/op
Factorial.testStreamedParallelShift 10000 avgt 20 2641.108 ± 142.571 us/op
Factorial.testStreamedParallelShift 50000 avgt 20 46480.544 ± 405.648 us/op
Factorial.testStreamedShift 10 avgt 20 0.402 ± 0.006 us/op
Factorial.testStreamedShift 100 avgt 20 5.086 ± 0.039 us/op
Factorial.testStreamedShift 1000 avgt 20 237.279 ± 1.566 us/op
Factorial.testStreamedShift 10000 avgt 20 27572.709 ± 135.489 us/op
Factorial.testStreamedShift 50000 avgt 20 874699.213 ± 53645.087 us/o
you can see that multithreaded version executed ~19 times faster than single thread (Core i7-4702MQ
used). But in C# version
static BigInteger Streamed(int n)
{
return n < 2 ? 1 : Enumerable.Range(2, n - 1).Aggregate(BigInteger.One, (acc, elm) => acc*elm);
}
static BigInteger StreamedParallel(int n)
{
return n < 2 ? 1 : Enumerable.Range(2, n - 1).AsParallel().Aggregate(BigInteger.One, (acc, elm) => acc * elm);
}
this code has worst performance comparing to all others, which is not surprising because of TPL overhead without performance benefit from multithreading.
So question is: why Java standard multithread library is so wise (any operation that takes 100us+ will be boosted, see reference
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html ), while C# cannot boost an operation that takes 1500ms on my machine.
I like C# and not really very like Java, this is why it hurts and I want to learn why it is what it is...
When using the Aggregate
method like this, PLinq will execute the aggregation sequentially, and therefore on a single thread. Of course, the multiplication could be executed in any order, but PLinq has no way to guess that. If the operation was a division for instance, changing the order of execution would change the end result.
One way to tell PLinq the query can be parallelized, is to use another Aggregate overload, that indicates how results from multiple threads can be merged:
return n < 2 ? 1 : Enumerable.Range(2, n - 1).AsParallel().Aggregate(BigInteger.One, (acc, elm) => acc * elm, (i, j) => i * j, i => i);
With this version, with n = 100000, it takes around 9000 ms for the sequential version, and 4400 ms for the parallel version. That's pretty much twice as fast, which is consistent with my hardware (dual core processor).
You can read this article for more information about how aggregations work with PLinq: http://blogs.msdn.com/b/pfxteam/archive/2008/01/22/7211660.aspx
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With