Would you please give me reference why there is significant difference in the execution time between the following 2 factorial implementations using the Java Stream API: <ol> <li>Serial implementation</li> <li>Parallel implementation (using Stream.parallel()) executed in a custom fork join pool with parallelism set to 1</li> </ol> My expectations were to have near execution times, however the parallel version has a speedup by a factor of 2. I did not run any specialized benchmarks however the execution time should not differ so much even in a cold start jvm. Bellow I attach the source code of the two implementations: <ul> <li>Parallel</li> </ul> <pre class="prettyprint"><code>public class FastFactorialSupplier implements FactorialSupplier { private final ExecutorService executorService; public FastFactorialSupplier(ExecutorService executorService) { this.executorService = executorService; } @Override public BigInteger get(long k) { try { return executorService .submit( () -> LongStream.range(2, k + 1) .parallel() .mapToObj(BigInteger::valueOf) .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current)) ) .get(); } catch (InterruptedException | ExecutionException e) { e.printStackTrace(); } return BigInteger.ZERO; } } </code></pre> <ul> <li>Serial</li> </ul> <pre class="prettyprint"><code>public class MathUtils { public static BigInteger factorial(long k) { return LongStream.range(2, k + 1) .mapToObj(BigInteger::valueOf) .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current)); } } </code></pre> Here are the test cases with attached sample execution time as a comments based on what the intellij junit runner showed. <pre class="prettyprint"><code> @Test public void testWithoutParallel() { //2s 403 runTest(new DummyFactorialSupplier()); // uses MathUtils.factorial } @Test public void testParallelismWorkStealing1() { //1s 43 runTest(new FastFactorialSupplier(Executors.newWorkStealingPool(1))); } @Test public void testParallelismForkJoin1() { // 711ms runTest(new FastFactorialSupplier(new ForkJoinPool(1))); } @Test public void testExecutorForkJoin() { //85ms runTest(new FastFactorialSupplier(new ForkJoinPool())); } private void runTest(FactorialSupplier factorialSupplier) { BigInteger result = factorialSupplier.get(100000); assertNotNull(result); // assertEquals(456574, result.toString().length()); } </code></pre> The tests were run using java 11 since there was a issue in java 8 with custom fork join pools - https://bugs.openjdk.java.net/browse/JDK-8190974 Can there be an optimisation related with the pseudo parallel processing and how the execution is scheduled whereas there is no such given the execution is purely sequential? Edit: I also run microbenchmark using jmh Parallel: <pre class="prettyprint"><code>public class FastFactorialSupplierP1Test { @Benchmark @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All}) @Fork(value = 1, warmups = 1) public void measure() { runTest(new FastFactorialSupplier(new ForkJoinPool(1))); } private void runTest(FactorialSupplier factorialSupplier) { BigInteger result = factorialSupplier.get(100000); assertNotNull(result); } public static void main(String[] args) throws Exception { org.openjdk.jmh.Main.main(args); } } </code></pre> Serial: <pre class="prettyprint"><code>public class SerialFactorialSupplierTest { @Benchmark @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All}) @Fork(value = 1, warmups = 1) public void measure() { runTest(new DummyFactorialSupplier()); } private void runTest(FactorialSupplier factorialSupplier) { BigInteger result = factorialSupplier.get(100000); assertNotNull(result); } public static void main(String[] args) throws Exception { org.openjdk.jmh.Main.main(args); } } </code></pre> <pre class="prettyprint"><code>public class IterativeFactorialTest { @Benchmark @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All}) @Fork(value = 1, warmups = 1) public void measure() { runTest(new IterativeFact()); } private void runTest(FactorialSupplier factorialSupplier) { BigInteger result = factorialSupplier.get(100000); assertNotNull(result); } public static void main(String[] args) throws Exception { org.openjdk.jmh.Main.main(args); } class IterativeFact implements FactorialSupplier { @Override public BigInteger get(long k) { BigInteger result = BigInteger.ONE; while (k-- != 0) { result = result.multiply(BigInteger.valueOf(k)); } return result; } } } </code></pre> Results: <pre class="prettyprint"><code>FastFactorialSupplierP1Test.measure avgt 5 0.437 ± 0.006 s/op IterativeFactorialTest.measure avgt 5 2.643 ± 0.383 s/op SerialFactorialSupplierTest.measure avgt 5 2.226 ± 0.044 s/op </code></pre>

You have chosen an operation whose performance depends on the order of evaluation. Just consider that the performance of <code>BigInteger.multiply</code> depends on the magnitude of the two factors. Then, running through a sequence of <code>BigInteger</code> instances with an accumulating value as a factor to the next multiplication will make the operation more and more expensive, the farther you get. In contrast, when you split the range of values into smaller ranges, perform the multiplication individually for each range and multiply the results of the ranges, you get a performance advantage, even if these sub-ranges are not evaluated concurrently. So when a parallel stream splits the work into chunks, to be potentially picked up by other worker threads, but ends up evaluating them in the same thread, you still get a performance improvement, in this specific setup, due to the changed evaluation order. We can test this by removing all Stream and thread pool related artifacts: <pre class="prettyprint"><code>public static BigInteger multiplyAll(long from, long to, int split) { if(split < 1 || to - from < 2) return serial(from, to); split--; long middle = (from + to) >>> 1; return multiplyAll(from, middle, split).multiply(multiplyAll(middle, to, split)); } private static BigInteger serial(long l1, long l2) { BigInteger bi = BigInteger.valueOf(l1++); for(; l1 < l2; l1++) { bi = bi.multiply(BigInteger.valueOf(l1)); } return bi; } </code></pre> I don’t have a JMH setup at hand, to post stressable results, but a simple run revealed that the order of magnitude matches your results, just a single split already roughly halves the evaluation time and higher numbers still improve the performance though the curve becomes flatter. As explained in <code>ForkJoinTask.html#getSurplusQueuedTaskCount()</code>, it’s a reasonable strategy to split work such that there are a few additional tasks per worker, to be potentially picked up by other threads, which may compensate unbalanced workloads, e.g. if some elements are cheaper to process than others. Apparently, parallel streams have no special code for handling the case that there are no additional worker threads, hence, you witness the effects of splitting the work, even when there is only one thread to process it.

Difference between serial and parallel execution with parallelism=1

Tags:

java

multithreading

java-stream

Would you please give me reference why there is significant difference in the execution time between the following 2 factorial implementations using the Java Stream API:

Serial implementation
Parallel implementation (using Stream.parallel()) executed in a custom fork join pool with parallelism set to 1

My expectations were to have near execution times, however the parallel version has a speedup by a factor of 2. I did not run any specialized benchmarks however the execution time should not differ so much even in a cold start jvm. Bellow I attach the source code of the two implementations:

Parallel

public class FastFactorialSupplier implements FactorialSupplier {
  private final ExecutorService executorService;

  public FastFactorialSupplier(ExecutorService executorService) {
      this.executorService = executorService;
  }

  @Override
  public BigInteger get(long k) {
      try {
          return executorService
                  .submit(
                          () -> LongStream.range(2, k + 1)
                                  .parallel()
                                  .mapToObj(BigInteger::valueOf)
                                  .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current))
                  )
                  .get();
      } catch (InterruptedException | ExecutionException e) {
          e.printStackTrace();
      }

      return BigInteger.ZERO;
  }
}

Serial

public class MathUtils {

  public static BigInteger factorial(long k) {
      return LongStream.range(2, k + 1)
              .mapToObj(BigInteger::valueOf)
              .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current));
  }
}

Here are the test cases with attached sample execution time as a comments based on what the intellij junit runner showed.

    @Test
    public void testWithoutParallel() {
        //2s 403
        runTest(new DummyFactorialSupplier()); // uses MathUtils.factorial
    }

    @Test
    public void testParallelismWorkStealing1() {
        //1s 43
        runTest(new FastFactorialSupplier(Executors.newWorkStealingPool(1)));
    }

    @Test
    public void testParallelismForkJoin1() {
        // 711ms
        runTest(new FastFactorialSupplier(new ForkJoinPool(1)));
    }

    @Test
    public void testExecutorForkJoin() {
        //85ms
        runTest(new FastFactorialSupplier(new ForkJoinPool()));
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
//        assertEquals(456574, result.toString().length());
    }

The tests were run using java 11 since there was a issue in java 8 with custom fork join pools - https://bugs.openjdk.java.net/browse/JDK-8190974

Can there be an optimisation related with the pseudo parallel processing and how the execution is scheduled whereas there is no such given the execution is purely sequential?

Edit:

I also run microbenchmark using jmh

Parallel:

public class FastFactorialSupplierP1Test {

    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new FastFactorialSupplier(new ForkJoinPool(1)));
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

Serial:

public class SerialFactorialSupplierTest {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new DummyFactorialSupplier());
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

public class IterativeFactorialTest {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new IterativeFact());
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }

    class IterativeFact implements FactorialSupplier {

        @Override
        public BigInteger get(long k) {
            BigInteger result = BigInteger.ONE;

            while (k-- != 0) {
                result = result.multiply(BigInteger.valueOf(k));
            }

            return result;
        }
    }
}

Results:

FastFactorialSupplierP1Test.measure                    avgt    5  0.437 ± 0.006   s/op
IterativeFactorialTest.measure                         avgt    5  2.643 ± 0.383   s/op
SerialFactorialSupplierTest.measure                    avgt    5  2.226 ± 0.044   s/op

272

asked Jun 02 '19 20:06

radpet

1 Answers

You have chosen an operation whose performance depends on the order of evaluation. Just consider that the performance of BigInteger.multiply depends on the magnitude of the two factors. Then, running through a sequence of BigInteger instances with an accumulating value as a factor to the next multiplication will make the operation more and more expensive, the farther you get.

In contrast, when you split the range of values into smaller ranges, perform the multiplication individually for each range and multiply the results of the ranges, you get a performance advantage, even if these sub-ranges are not evaluated concurrently.

So when a parallel stream splits the work into chunks, to be potentially picked up by other worker threads, but ends up evaluating them in the same thread, you still get a performance improvement, in this specific setup, due to the changed evaluation order.

We can test this by removing all Stream and thread pool related artifacts:

public static BigInteger multiplyAll(long from, long to, int split) {
    if(split < 1 || to - from < 2) return serial(from, to);
    split--;
    long middle = (from + to) >>> 1;
    return multiplyAll(from, middle, split).multiply(multiplyAll(middle, to, split));
}

private static BigInteger serial(long l1, long l2) {
    BigInteger bi = BigInteger.valueOf(l1++);
    for(; l1 < l2; l1++) {
        bi = bi.multiply(BigInteger.valueOf(l1));
    }
    return bi;
}

I don’t have a JMH setup at hand, to post stressable results, but a simple run revealed that the order of magnitude matches your results, just a single split already roughly halves the evaluation time and higher numbers still improve the performance though the curve becomes flatter.

As explained in ForkJoinTask.html#getSurplusQueuedTaskCount(), it’s a reasonable strategy to split work such that there are a few additional tasks per worker, to be potentially picked up by other threads, which may compensate unbalanced workloads, e.g. if some elements are cheaper to process than others. Apparently, parallel streams have no special code for handling the case that there are no additional worker threads, hence, you witness the effects of splitting the work, even when there is only one thread to process it.

154

answered Oct 26 '22 23:10

Holger

Related questions
                            
                                How can I impersonate a user of AppEngine java application operating in G-Suite domain?
                            
                                configure swagger-ui with maven
                            
                                "Sudo su - weblogic" via a Java Program?
                            
                                How to make a video file by capturing the animated view in android or java?
                            
                                NoSuchMethodError: <init> in com.sun.glass.ui.win.WinApplication.staticScreen_getScreens
                            
                                JVM language interoperability
                            
                                Java generics - too complicated? How to simplify?
                            
                                Indeterminate ProgressBar does not animate when part of a Dialog (JavaFX 10)
                            
                                vertx - how to read stream from executable program async
                            
                                How to get and display Wordpress featured media and author image?
                            
                                How to perform Mouse Wheel scrolling over HTML5 Canvas in Selenium?
                            
                                Should JavaDelegate classes for Camunda BPM be thread safe?
                            
                                how to solve Caused by: java.lang.ClassNotFoundException: javax.xml.bind.JAXBException migrating to Java 11(Openjdk-11.0.1 )
                            
                                Tests run under JUnit 4 but not JUnit 5 — Compiles clean, but 0 tests execute
                            
                                Mapstruct - Ambiguous mapping methods found for mapping property
                            
                                Is there a performance difference between multiple "if" statements vs. "if else if" for mutually exclusive conditions?
                            
                                Passing JWT token to SockJS
                            
                                Using .p12 file to execute request to rest server
                            
                                How to return by value from native function?
                            
                                javax.imageio.IIOException: Missing Huffman code table entry while Adding text to an jpg image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With