This is the case: <pre class="prettyprint"><code>cat sum100000000.cpp && cat sum100000000.java #include <cstdio> using namespace std; int main(){ long N=1000000000,sum=0; for( long i=0; i<N; i++ ) sum+= i; printf("%ld\n",sum); } public class sum100000000 { public static void main(String[] args) { long sum=0; for(long i = 0; i < 1000000000; i++) sum += i; System.out.println(sum); } } </code></pre> This is the result: <pre class="prettyprint"><code>time ./a.out && time java sum100000000 499999999500000000 real 0m2.675s user 0m2.673s sys 0m0.002s Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 499999999500000000 real 0m0.439s user 0m0.470s sys 0m0.027s </code></pre> Not seeing anything unusual in the disassembled binary. but seems c binary is significantly slower. which I couldn't understand. My guess is that there might be some problem with the toolchain <pre class="prettyprint"><code>clang -v Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn) Target: x86_64-apple-darwin13.4.0 Thread model: posix uname -a Darwin MacBook-Pro.local 13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64 x86_64 </code></pre> Add: c / cpp compiled without any special binary. This is not changing the result anyways. <pre class="prettyprint"><code>gcc sum1b.cpp clang sum1b.cpp </code></pre> Add: for those concerned with llvm, really not changing anything <pre class="prettyprint"><code>$ gcc sum100000000.cpp && time ./a.out gcc sum100000000.cpp && time ./a.out 499999999500000000 real 0m2.722s user 0m2.717s sys 0m0.003s </code></pre> Modification: O2 is a lot faster: but that looks like a cheat <pre class="prettyprint"><code>$ otool -tV a.out otool -tV a.out a.out: (__TEXT,__text) section _main: 0000000100000f50 pushq %rbp 0000000100000f51 movq %rsp, %rbp 0000000100000f54 leaq 0x37(%rip), %rdi ## literal pool for: "%ld " 0000000100000f5b movabsq $0x6f05b59b5e49b00, %rsi 0000000100000f65 xorl %eax, %eax 0000000100000f67 callq 0x100000f70 ## symbol stub for: _printf 0000000100000f6c xorl %eax, %eax 0000000100000f6e popq %rbp 0000000100000f6f ret </code></pre> I am convinced now that this is related with Optimisation, so now the question more on what did the JIT do exactly to speed up this calculation?

The issue is most likely that you're not compiling the C version with optimization enabled. If you enable aggressive optimization, the binary produced by <code>gcc</code> should win. The JVM's JIT is good, but the simple fact is that the JVM has to load and then apply JIT at runtime; <code>gcc</code> can optimize the binary at compile-time. Leaving off all <code>gcc</code> flags gives me a binary that performs fairly slowly, like yours. Using <code>-O2</code> gives me a binary that just barely loses out to the Java version. Using <code>-O3</code> gives me one that easily beats the Java version. (This is on my Linux Mint 16 64-bit machine with <code>gcc</code> 4.8.1 and Java 1.8.0_20 [e.g., Java 8 Update 20].) larsmans examined the disassembly of the <code>-O3</code> version and assures me that the compiler isn't precomputing the result (my C and assembly fu is very weak these days; many thanks to larsmans for double-checking that). Interestingly, though, thanks to investigation by Mat, it looks like that's actually a byproduct of my using <code>gcc</code> 4.8.1; earlier and later versions of <code>gcc</code> seem to be willing to precompute the result. Happy accident for us, though. Here's my pure C version [I've also updated it to account for Ajay's comment about your using a constant in the Java version, but the variable <code>N</code> in the C version (didn't make any real difference, but...)]: <code>sum.c</code>: <pre class="prettyprint"><code>#include <stdio.h> int main(){ long sum=0; long i; for( i=0; i<1000000000; i++ ) sum+= i; printf("%ld\n",sum); } </code></pre> My Java version is unchanged from yours other than that I lost track of the zeroes too easily: <code>sum.java</code>: <pre class="prettyprint"><code>public class sum { public static void main(String[] args) { long sum=0; for(long i = 0; i < 1000000000; i++) sum += i; System.out.println(sum); } } </code></pre> Results: C binary run (compiled via <code>gcc sum.c</code>): <pre class="prettyprint"> $ time ./a.out 499999999500000000 real 0m2.436s user 0m2.429s sys 0m0.004s </pre> Java run (compiled with no special flags, run with no special runtime flags): <pre class="prettyprint"> $ time java sum 499999999500000000 real 0m0.691s user 0m0.684s sys 0m0.020s </pre> Java run (compiled with no special flags, run with <code>-server -noverify</code>, tiny improvement): <pre class="prettyprint"> $ time java -server -noverify sum 499999999500000000 real 0m0.651s user 0m0.649s sys 0m0.016s </pre> C binary run (compiled via <code>gcc -O2 sum.c</code>): <pre class="prettyprint"> $ time ./a.out 499999999500000000 real 0m0.733s user 0m0.732s sys 0m0.000s </pre> C binary run (compiled via <code>gcc -O3 sum.c</code>): <pre class="prettyprint"> $ time ./a.out 499999999500000000 real 0m0.373s user 0m0.372s sys 0m0.000s </pre> Here's the <code>main</code> result of <code>objdump -d a.out</code> on my <code>-O3</code> version: <pre class="prettyprint"> 0000000000400470 : 400470: 66 0f 6f 1d 08 02 00 movdqa 0x208(%rip),%xmm3 # 400680 400477: 00 400478: 31 c0 xor %eax,%eax 40047a: 66 0f ef c9 pxor %xmm1,%xmm1 40047e: 66 0f 6f 05 ea 01 00 movdqa 0x1ea(%rip),%xmm0 # 400670 400485: 00 400486: eb 0c jmp 400494 400488: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 40048f: 00 400490: 66 0f 6f c2 movdqa %xmm2,%xmm0 400494: 66 0f 6f d0 movdqa %xmm0,%xmm2 400498: 83 c0 01 add $0x1,%eax 40049b: 66 0f d4 c8 paddq %xmm0,%xmm1 40049f: 3d 00 65 cd 1d cmp $0x1dcd6500,%eax 4004a4: 66 0f d4 d3 paddq %xmm3,%xmm2 4004a8: 75 e6 jne 400490 4004aa: 66 0f 6f e1 movdqa %xmm1,%xmm4 4004ae: be 64 06 40 00 mov $0x400664,%esi 4004b3: bf 01 00 00 00 mov $0x1,%edi 4004b8: 31 c0 xor %eax,%eax 4004ba: 66 0f 73 dc 08 psrldq $0x8,%xmm4 4004bf: 66 0f d4 cc paddq %xmm4,%xmm1 4004c3: 66 0f 7f 4c 24 e8 movdqa %xmm1,-0x18(%rsp) 4004c9: 48 8b 54 24 e8 mov -0x18(%rsp),%rdx 4004ce: e9 8d ff ff ff jmpq 400460 </pre> As I say, my assembly-fu is very weak, but I'm seeing a loop there, rather than the compiler having done the math. And just for completeness, the <code>main</code> part of the result of <code>javap -c sum</code>: <pre class="prettyprint"> public static void main(java.lang.String[]); Code: 0: lconst_0 1: lstore_1 2: lconst_0 3: lstore_3 4: lload_3 5: ldc2_w #2 // long 1000000000l 8: lcmp 9: ifge 23 12: lload_1 13: lload_3 14: ladd 15: lstore_1 16: lload_3 17: lconst_1 18: ladd 19: lstore_3 20: goto 4 23: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream; 26: lload_1 27: invokevirtual #5 // Method java/io/PrintStream.println:(J)V 30: return </pre> It's not precomputing the result at the bytecode level; I can't say what the JIT is doing.

It makes sense to optimize <code>java</code> byte code when comparing with <code>gcc</code> optimized byte code. Accepted answer did not include <code>-server -noverify</code> switches. Optionally, <code>-Xcomp</code> could be applied but it is not recommended. <ol> <li>CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz</li> <li>java version "1.8.0_25" Java(TM) SE Runtime Environment (build 1.8.0_25-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)</li> <li>gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2</li> </ol> sum.java with flags: <pre class="prettyprint"><code>time java -server -noverify sum 499999999500000000 real 0m0.299s user 0m0.303s sys 0m0.004s </code></pre> and sum.c with 03 flag: <pre class="prettyprint"><code>time ./a.out 499999999500000000 real 0m0.233s user 0m0.233s sys 0m0.000s </code></pre>

c slower than java when summing a billion integers

Tags:

java

c++

This is the case:

cat sum100000000.cpp && cat sum100000000.java 
#include <cstdio>

using namespace std;

int main(){
  long N=1000000000,sum=0;
  for( long i=0; i<N; i++ ) sum+= i;
  printf("%ld\n",sum);
}


public class sum100000000 {
    public static void main(String[] args) {
        long sum=0;
        for(long i = 0; i < 1000000000; i++) sum += i;
        System.out.println(sum);
    }
}

This is the result:

time ./a.out && time java sum100000000
499999999500000000

real    0m2.675s
user    0m2.673s
sys 0m0.002s
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
499999999500000000

real    0m0.439s
user    0m0.470s
sys 0m0.027s

Not seeing anything unusual in the disassembled binary. but seems c binary is significantly slower. which I couldn't understand.

My guess is that there might be some problem with the toolchain

clang -v
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix

uname -a
Darwin MacBook-Pro.local 13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64 x86_64

Add: c / cpp compiled without any special binary. This is not changing the result anyways.

gcc sum1b.cpp
clang sum1b.cpp

Add: for those concerned with llvm, really not changing anything

$ gcc sum100000000.cpp && time ./a.out
gcc sum100000000.cpp && time ./a.out
499999999500000000

real    0m2.722s
user    0m2.717s
sys 0m0.003s

Modification: O2 is a lot faster: but that looks like a cheat

$ otool -tV a.out
otool -tV a.out
a.out:
(__TEXT,__text) section
_main:
0000000100000f50    pushq   %rbp
0000000100000f51    movq    %rsp, %rbp
0000000100000f54    leaq    0x37(%rip), %rdi ## literal pool for: "%ld
"
0000000100000f5b    movabsq $0x6f05b59b5e49b00, %rsi
0000000100000f65    xorl    %eax, %eax
0000000100000f67    callq   0x100000f70 ## symbol stub for: _printf
0000000100000f6c    xorl    %eax, %eax
0000000100000f6e    popq    %rbp
0000000100000f6f    ret

I am convinced now that this is related with Optimisation, so now the question more on what did the JIT do exactly to speed up this calculation?

612

asked Nov 08 '14 09:11

zinking

2 Answers

The issue is most likely that you're not compiling the C version with optimization enabled. If you enable aggressive optimization, the binary produced by gcc should win. The JVM's JIT is good, but the simple fact is that the JVM has to load and then apply JIT at runtime; gcc can optimize the binary at compile-time.

Leaving off all gcc flags gives me a binary that performs fairly slowly, like yours. Using -O2 gives me a binary that just barely loses out to the Java version. Using -O3 gives me one that easily beats the Java version. (This is on my Linux Mint 16 64-bit machine with gcc 4.8.1 and Java 1.8.0_20 [e.g., Java 8 Update 20].) larsmans examined the disassembly of the -O3 version and assures me that the compiler isn't precomputing the result (my C and assembly fu is very weak these days; many thanks to larsmans for double-checking that). Interestingly, though, thanks to investigation by Mat, it looks like that's actually a byproduct of my using gcc 4.8.1; earlier and later versions of gcc seem to be willing to precompute the result. Happy accident for us, though.

Here's my pure C version [I've also updated it to account for Ajay's comment about your using a constant in the Java version, but the variable N in the C version (didn't make any real difference, but...)]:

sum.c:

#include <stdio.h>

int main(){
  long sum=0;
  long i;
  for( i=0; i<1000000000; i++ ) sum+= i;
  printf("%ld\n",sum);
}

My Java version is unchanged from yours other than that I lost track of the zeroes too easily:

sum.java:

public class sum {
    public static void main(String[] args) {
        long sum=0;
        for(long i = 0; i < 1000000000; i++) sum += i;
        System.out.println(sum);
    }
}

Results:

C binary run (compiled via gcc sum.c):

$ time ./a.out
499999999500000000

real    0m2.436s
user    0m2.429s
sys 0m0.004s

Java run (compiled with no special flags, run with no special runtime flags):

$ time java sum
499999999500000000

real    0m0.691s
user    0m0.684s
sys 0m0.020s

Java run (compiled with no special flags, run with -server -noverify, tiny improvement):

$ time java -server -noverify sum
499999999500000000

real    0m0.651s
user    0m0.649s
sys 0m0.016s

C binary run (compiled via gcc -O2 sum.c):

$ time ./a.out
499999999500000000

real    0m0.733s
user    0m0.732s
sys 0m0.000s

C binary run (compiled via gcc -O3 sum.c):

$ time ./a.out
499999999500000000

real    0m0.373s
user    0m0.372s
sys 0m0.000s

Here's the main result of objdump -d a.out on my -O3 version:

0000000000400470 :
  400470:   66 0f 6f 1d 08 02 00    movdqa 0x208(%rip),%xmm3        # 400680 
  400477:   00 
  400478:   31 c0                   xor    %eax,%eax
  40047a:   66 0f ef c9             pxor   %xmm1,%xmm1
  40047e:   66 0f 6f 05 ea 01 00    movdqa 0x1ea(%rip),%xmm0        # 400670 
  400485:   00 
  400486:   eb 0c                   jmp    400494 
  400488:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40048f:   00 
  400490:   66 0f 6f c2             movdqa %xmm2,%xmm0
  400494:   66 0f 6f d0             movdqa %xmm0,%xmm2
  400498:   83 c0 01                add    $0x1,%eax
  40049b:   66 0f d4 c8             paddq  %xmm0,%xmm1
  40049f:   3d 00 65 cd 1d          cmp    $0x1dcd6500,%eax
  4004a4:   66 0f d4 d3             paddq  %xmm3,%xmm2
  4004a8:   75 e6                   jne    400490 
  4004aa:   66 0f 6f e1             movdqa %xmm1,%xmm4
  4004ae:   be 64 06 40 00          mov    $0x400664,%esi
  4004b3:   bf 01 00 00 00          mov    $0x1,%edi
  4004b8:   31 c0                   xor    %eax,%eax
  4004ba:   66 0f 73 dc 08          psrldq $0x8,%xmm4
  4004bf:   66 0f d4 cc             paddq  %xmm4,%xmm1
  4004c3:   66 0f 7f 4c 24 e8       movdqa %xmm1,-0x18(%rsp)
  4004c9:   48 8b 54 24 e8          mov    -0x18(%rsp),%rdx
  4004ce:   e9 8d ff ff ff          jmpq   400460

As I say, my assembly-fu is very weak, but I'm seeing a loop there, rather than the compiler having done the math.

And just for completeness, the main part of the result of javap -c sum:

  public static void main(java.lang.String[]);
    Code:
       0: lconst_0
       1: lstore_1
       2: lconst_0
       3: lstore_3
       4: lload_3
       5: ldc2_w        #2                  // long 1000000000l
       8: lcmp
       9: ifge          23
      12: lload_1
      13: lload_3
      14: ladd
      15: lstore_1
      16: lload_3
      17: lconst_1
      18: ladd
      19: lstore_3
      20: goto          4
      23: getstatic     #4                  // Field java/lang/System.out:Ljava/io/PrintStream;
      26: lload_1
      27: invokevirtual #5                  // Method java/io/PrintStream.println:(J)V
      30: return

It's not precomputing the result at the bytecode level; I can't say what the JIT is doing.

156

answered Oct 27 '22 14:10

7 revs

It makes sense to optimize java byte code when comparing with gcc optimized byte code. Accepted answer did not include -server -noverify switches. Optionally, -Xcomp could be applied but it is not recommended.

CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
java version "1.8.0_25" Java(TM) SE Runtime Environment (build 1.8.0_25-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

sum.java with flags:

time java -server -noverify sum
499999999500000000

real    0m0.299s
user    0m0.303s
sys 0m0.004s

and sum.c with 03 flag:

time ./a.out 
499999999500000000

real    0m0.233s
user    0m0.233s
sys 0m0.000s

answered Oct 27 '22 12:10

Mon Calamari

Related questions
                            
                                Immutable and pass by value
                            
                                Text align in JLabel [duplicate]
                            
                                How to hide grid lines in JTable
                            
                                java.lang.OutOfMemoryError ( PermGen space) and java.lang.ClassNotFoundException at the opening the jsp page
                            
                                What does the modcount variable when debugging the collection [duplicate]
                            
                                What happens when an object is assigned to another object
                            
                                How to divide down to decimals and not get zero in Java? [duplicate]
                            
                                Is String get/set threadsafe?
                            
                                Spring core 3.2.2 + Spring security 3.1.4: java.lang.IncompatibleClassChangeError: org.springframework.asm.ClassVisitor
                            
                                Appending Byte[] to end of a binary file
                            
                                Method interference in Google Cloud Endpoints with Google Eclipse Plugin
                            
                                java.net.URISyntaxException: Illegal character in scheme at index 0 [duplicate]
                            
                                int a = (int) ((0.7 + 0.1) * 10). Why a = 7? [duplicate]
                            
                                Accept only numbers and a dot in Java TextField
                            
                                Is Hashing a suitable solution? Am I over-complicating it?
                            
                                getCssValue (Color) in Hex format in Selenium WebDriver
                            
                                Android: How to call getActivity() in OnItemClickListener()?
                            
                                Why removeRange method of ArrayList class is not working? [duplicate]
                            
                                2D Array grid on drawing canvas
                            
                                radio group setOnCheckedChangeListener

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With