Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between serial and parallel execution with parallelism=1

Would you please give me reference why there is significant difference in the execution time between the following 2 factorial implementations using the Java Stream API:

  1. Serial implementation
  2. Parallel implementation (using Stream.parallel()) executed in a custom fork join pool with parallelism set to 1

My expectations were to have near execution times, however the parallel version has a speedup by a factor of 2. I did not run any specialized benchmarks however the execution time should not differ so much even in a cold start jvm. Bellow I attach the source code of the two implementations:

  • Parallel
public class FastFactorialSupplier implements FactorialSupplier {
  private final ExecutorService executorService;

  public FastFactorialSupplier(ExecutorService executorService) {
      this.executorService = executorService;
  }

  @Override
  public BigInteger get(long k) {
      try {
          return executorService
                  .submit(
                          () -> LongStream.range(2, k + 1)
                                  .parallel()
                                  .mapToObj(BigInteger::valueOf)
                                  .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current))
                  )
                  .get();
      } catch (InterruptedException | ExecutionException e) {
          e.printStackTrace();
      }

      return BigInteger.ZERO;
  }
}
  • Serial
public class MathUtils {

  public static BigInteger factorial(long k) {
      return LongStream.range(2, k + 1)
              .mapToObj(BigInteger::valueOf)
              .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current));
  }
}

Here are the test cases with attached sample execution time as a comments based on what the intellij junit runner showed.

    @Test
    public void testWithoutParallel() {
        //2s 403
        runTest(new DummyFactorialSupplier()); // uses MathUtils.factorial
    }

    @Test
    public void testParallelismWorkStealing1() {
        //1s 43
        runTest(new FastFactorialSupplier(Executors.newWorkStealingPool(1)));
    }

    @Test
    public void testParallelismForkJoin1() {
        // 711ms
        runTest(new FastFactorialSupplier(new ForkJoinPool(1)));
    }

    @Test
    public void testExecutorForkJoin() {
        //85ms
        runTest(new FastFactorialSupplier(new ForkJoinPool()));
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
//        assertEquals(456574, result.toString().length());
    }

The tests were run using java 11 since there was a issue in java 8 with custom fork join pools - https://bugs.openjdk.java.net/browse/JDK-8190974

Can there be an optimisation related with the pseudo parallel processing and how the execution is scheduled whereas there is no such given the execution is purely sequential?

Edit:

I also run microbenchmark using jmh

Parallel:

public class FastFactorialSupplierP1Test {

    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new FastFactorialSupplier(new ForkJoinPool(1)));
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

Serial:

public class SerialFactorialSupplierTest {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new DummyFactorialSupplier());
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}
public class IterativeFactorialTest {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new IterativeFact());
    }

    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }

    class IterativeFact implements FactorialSupplier {

        @Override
        public BigInteger get(long k) {
            BigInteger result = BigInteger.ONE;

            while (k-- != 0) {
                result = result.multiply(BigInteger.valueOf(k));
            }

            return result;
        }
    }
}

Results:

FastFactorialSupplierP1Test.measure                    avgt    5  0.437 ± 0.006   s/op
IterativeFactorialTest.measure                         avgt    5  2.643 ± 0.383   s/op
SerialFactorialSupplierTest.measure                    avgt    5  2.226 ± 0.044   s/op
like image 272
radpet Avatar asked Jun 02 '19 20:06

radpet


People also ask

What is the difference between parallel and serial execution give example?

Definition. Serial processing is a type of processing in which one task is completed at a time and all the tasks are executed by the processor in a sequence. Parallel processing is a type of processing in which multiple tasks are completed at a time by different processors.

What is the difference between serial and parallel processing?

Serial processing allows only one object at a time to be processed, whereas parallel processing assumes that various objects are processed simultaneously.

What is the difference between concurrent execution and parallel execution?

Concurrency is when multiple tasks can run in overlapping periods. It's an illusion of multiple tasks running in parallel because of a very fast switching by the CPU. Two tasks can't run at the same time in a single-core CPU. Parallelism is when tasks actually run in parallel in multiple CPUs.

What is parallelism in operating system?

Parallelism is related to an application where tasks are divided into smaller sub-tasks that are processed seemingly simultaneously or parallel. It is used to increase the throughput and computational speed of the system by using multiple processors.


1 Answers

You have chosen an operation whose performance depends on the order of evaluation. Just consider that the performance of BigInteger.multiply depends on the magnitude of the two factors. Then, running through a sequence of BigInteger instances with an accumulating value as a factor to the next multiplication will make the operation more and more expensive, the farther you get.

In contrast, when you split the range of values into smaller ranges, perform the multiplication individually for each range and multiply the results of the ranges, you get a performance advantage, even if these sub-ranges are not evaluated concurrently.

So when a parallel stream splits the work into chunks, to be potentially picked up by other worker threads, but ends up evaluating them in the same thread, you still get a performance improvement, in this specific setup, due to the changed evaluation order.

We can test this by removing all Stream and thread pool related artifacts:

public static BigInteger multiplyAll(long from, long to, int split) {
    if(split < 1 || to - from < 2) return serial(from, to);
    split--;
    long middle = (from + to) >>> 1;
    return multiplyAll(from, middle, split).multiply(multiplyAll(middle, to, split));
}

private static BigInteger serial(long l1, long l2) {
    BigInteger bi = BigInteger.valueOf(l1++);
    for(; l1 < l2; l1++) {
        bi = bi.multiply(BigInteger.valueOf(l1));
    }
    return bi;
}

I don’t have a JMH setup at hand, to post stressable results, but a simple run revealed that the order of magnitude matches your results, just a single split already roughly halves the evaluation time and higher numbers still improve the performance though the curve becomes flatter.

As explained in ForkJoinTask.html#getSurplusQueuedTaskCount(), it’s a reasonable strategy to split work such that there are a few additional tasks per worker, to be potentially picked up by other threads, which may compensate unbalanced workloads, e.g. if some elements are cheaper to process than others. Apparently, parallel streams have no special code for handling the case that there are no additional worker threads, hence, you witness the effects of splitting the work, even when there is only one thread to process it.

like image 154
Holger Avatar answered Oct 26 '22 23:10

Holger