Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are two separate loops faster than one?

Tags:

I want to understand what kind of optimizations Java does to consecutive for loops. More precisely, I'm trying to check if loop fusion is performed. Theoretically, I was expecting that this optimization was not done automatically and was expecting to confirm that the fused version was faster than the version with two loops.

However, after running the benchmarks, the results show that two separate (and consecutive) loops are faster than one single loop doing all the work.

I already tried using JMH for creating the benchmarks and got the same results.

I used the javap command and it shows that the generated bytecode for the source file with two loops actually corresponds to two loops being executed (no loop unrolling or other optimization was performed).

Code being measured for BenchmarkMultipleLoops.java:

private void work() {         List<Capsule> intermediate = new ArrayList<>();         List<String> res = new ArrayList<>();         int totalLength = 0;          for (Capsule c : caps) {             if(c.getNumber() > 100000000){                 intermediate.add(c);             }         }          for (Capsule c : intermediate) {             String s = "new_word" + c.getNumber();             res.add(s);         }          //Loop to assure the end result (res) is used for something         for(String s : res){             totalLength += s.length();         }          System.out.println(totalLength);     } 

Code being measured for BenchmarkSingleLoop.java:

private void work(){         List<String> res = new ArrayList<>();         int totalLength = 0;          for (Capsule c : caps) {             if(c.getNumber() > 100000000){                 String s = "new_word" + c.getNumber();                 res.add(s);             }         }          //Loop to assure the end result (res) is used for something         for(String s : res){             totalLength += s.length();         }          System.out.println(totalLength);     } 

And here is the code for Capsule.java:

public class Capsule {     private int number;     private String word;      public Capsule(int number, String word) {         this.number = number;         this.word = word;     }      public int getNumber() {         return number;     }      @Override     public String toString() {         return "{" + number +                 ", " + word + '}';     } } 

caps is an ArrayList<Capsule> with 20 million elements populated like this in the beginning:

private void populate() {         Random r = new Random(3);          for(int n = 0; n < POPSIZE; n++){             int randomN = r.nextInt();             Capsule c = new Capsule(randomN, "word" + randomN);             caps.add(c);         }     } 

Before measuring, a warm up phase is executed.

I ran each of the benchmarks 10 times or, in other words, the work() method is executed 10 times for each benchmark and the average times to complete are presented below (in seconds). After each iteration, the GC was executed along with a few sleeps:

  • MultipleLoops: 4.9661 seconds
  • SingleLoop: 7.2725 seconds

OpenJDK 1.8.0_144 running on an Intel i7-7500U (Kaby Lake).

Why is the MultipleLoops version faster than the SingleLoop version, even though it has to traverse two different data structures?

UPDATE 1:

As suggested in the comments, if I change the implementation to calculate the totalLength while strings are being generated, avoiding the creation of the res list, the single loop version becomes faster.

However, that variable was only introduced so that some work was done after creating the result list, in order to avoid discarding the elements if nothing was done with them.

In other words, the intended result is to produce the final list. But this suggestion helps in better understanding what is going on.

Results:

  • MultipleLoops: 0.9339 seconds
  • SingleLoop: 0.66590005 seconds

UPDATE 2:

Here is a link for the code I used for the JMH benchmark: https://gist.github.com/FranciscoRibeiro/2d3928761f76e4f7cecfcfcdf7fc96d5

Results:

  • MultipleLoops: 7.397 seconds
  • SingleLoop: 8.092 seconds
like image 576
Francisco Ribeiro Avatar asked Feb 23 '18 16:02

Francisco Ribeiro


People also ask

Are two for loops slower than one?

It would depend on the logic inside of them (nested loops etc). Theoretically, They'd run the same, as they're both linear. (Both are 100 iterations). So the Big O Time Complexity is O(N), where N is the size of the loop.

Is single loop faster than nested loop?

Nested loops are not faster than single loops.

Why does a nested loop perform much faster than the flattened one?

The two kinds of loops above have the same iteration times. However the nested loop(2.43s) is running much faster that the flattened one(18.22s). The difference is bigger with the increasing of the iteration time.

Which loops is faster?

For loop (forward and reverse) The traditional for loop is the fastest, so you should always use that right? Not so fast - performance is not the only thing that matters. Code Readability is usually more important, so default to the style that fits your application.


2 Answers

I investigated this "phenomena" and look like got something like an answer.
Let's add .jvmArgs("-verbose:gc") to JMHs OptionsBuilder. Results for 1 Iteration:

Single Loop: [Full GC (Ergonomics) [PSYoungGen: 2097664K->0K(2446848K)] [ParOldGen: 3899819K->4574771K(5592576K)] 5997483K->4574771K(8039424K), [Metaspace: 6208K->6208K(1056768K)], 5.0438301 secs] [Times: user=37.92 sys=0.10, real=5.05 secs] 4.954 s/op

Multiple Loops: [Full GC (Ergonomics) [PSYoungGen: 2097664K->0K(2446848K)] [ParOldGen: 3899819K->4490913K(5592576K)] 5997483K->4490913K(8039424K), [Metaspace: 6208K->6208K(1056768K)], 3.7991573 secs] [Times: user=26.84 sys=0.08, real=3.80 secs] 4.187 s/op

JVM spent huge amount of CPU time for GC. Once per 2 test runs, JVM have to make Full GC (move 600Mb to OldGen and collect 1.5Gb of trash from previous cycles). Both garbage collectors done the same work, but spent ~25% less application time for multiple loops test case. If we decrease the POPSIZE to 10_000_000 or add to before bh.consume() Thread.sleep(3000), or add -XX:+UseG1GC to JVM args, then multiple loop boosting effect gone. I run it once again with .addProfiler(GCProfiler.class). The main difference:

Multiple Loops: gc.churn.PS_Eden_Space 374.417 ± 23 MB/sec

Single Loop: gc.churn.PS_Eden_Space 336.037 MB/sec ± 19 MB/sec

I think, we see speed up in such specific circumstances, because old-good Compare and Swap GC algorithm has CPU bottleneck for multiples test runs and uses extra "senseless" cycle for Collect Garbage from earlier runs. It's even easier to reproduce with @Threads(2), if you have enough RAM. It looks like this, if you try to profile Single_Loop test:

profiling

like image 88
Anton Kot Avatar answered Sep 21 '22 05:09

Anton Kot


To understand what is happening under the hood, you could add JMX behaviour to analyse running app in jvisualvm, located in JAVA_HOME\bin With 20M size of a capsule list in the memory, it ran out of memory and visualvm went in non-responding state. I had reduced capsule list size to 200k and 100M to 1M in if condition to test. After observing behaviour on visualvm, single loop execution completed prior to multiple loops. Maybe this is not the right approach, but you could experiment with it.

LoopBean.java

import java.util.List; public interface LoopMBean {     void multipleLoops();     void singleLoop();     void printResourcesStats(); } 

Loop.java

import java.util.ArrayList; import java.util.List; import java.util.Random;  public class Loop implements LoopMBean {      private final List<Capsule> capsules = new ArrayList<>();      {         Random r = new Random(3);         for (int n = 0; n < 20000000; n++) {             int randomN = r.nextInt();             capsules.add(new Capsule(randomN, "word" + randomN));         }     }      @Override     public void multipleLoops() {          System.out.println("----------------------Before multiple loops execution---------------------------");         printResourcesStats();          final List<Capsule> intermediate = new ArrayList<>();         final List<String> res = new ArrayList<>();         int totalLength = 0;          final long start = System.currentTimeMillis();          for (Capsule c : capsules)             if (c.getNumber() > 100000000) {                 intermediate.add(c);             }          for (Capsule c : intermediate) {             String s = "new_word" + c.getNumber();             res.add(s);         }          for (String s : res)             totalLength += s.length();          System.out.println("multiple loops=" + totalLength + " | time taken=" + (System.currentTimeMillis() - start) + " milliseconds");          System.out.println("----------------------After multiple loops execution---------------------------");         printResourcesStats();          res.clear();     }      @Override     public void singleLoop() {          System.out.println("----------------------Before single loop execution---------------------------");         printResourcesStats();          final List<String> res = new ArrayList<>();         int totalLength = 0;          final long start = System.currentTimeMillis();          for (Capsule c : capsules)             if (c.getNumber() > 100000000) {                 String s = "new_word" + c.getNumber();                 res.add(s);             }          for (String s : res)             totalLength += s.length();          System.out.println("Single loop=" + totalLength + " | time taken=" + (System.currentTimeMillis() - start) + " milliseconds");         System.out.println("----------------------After single loop execution---------------------------");         printResourcesStats();          res.clear();     }      @Override     public void printResourcesStats() {         System.out.println("Max Memory= " + Runtime.getRuntime().maxMemory());         System.out.println("Available Processors= " + Runtime.getRuntime().availableProcessors());         System.out.println("Total Memory= " + Runtime.getRuntime().totalMemory());         System.out.println("Free Memory= " + Runtime.getRuntime().freeMemory());     } } 

LoopClient.java

import javax.management.MBeanServer; import javax.management.ObjectName; import java.lang.management.ManagementFactory;  public class LoopClient {      void init() {          final MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();         try {             mBeanServer.registerMBean(new Loop(), new ObjectName("LOOP:name=LoopBean"));         } catch (Exception e) {             e.printStackTrace();         }      }      public static void main(String[] args) {          final LoopClient client = new LoopClient();         client.init();         System.out.println("Loop client is running...");         waitForEnterPressed();     }      private static void waitForEnterPressed() {         try {             System.out.println("Press  to continue...");             System.in.read();         } catch (Exception e) {             e.printStackTrace();         }     } } 

Execute with following command:

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false LoopClient 

You could add -Xmx3072M extra option for quick memory increase to avoid OutOfMemoryError

like image 30
Adi Avatar answered Sep 20 '22 05:09

Adi