Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a useful difference between (p ^ q) and (p != q) for booleans?

Java has two ways of checking whether two booleans differ. You can compare them with !=, or with ^ (xor). Of course, these two operators produce the same result in all cases. Still, it makes sense for both of them to be included, as discussed, for example, in What's the difference between XOR and NOT-EQUAL-TO?. It even makes sense for developers to prefer one over the other depending on context - sometimes "is exactly one of these booleans true" reads better, and other times "are these two booleans different" communicates intent better. So, perhaps which one to use should be a matter of taste and style.

What surprised me is that javac does not treat these identically! Consider this class:

class Test {
  public boolean xor(boolean p, boolean q) {
    return p ^ q;
  }
  public boolean inequal(boolean p, boolean q) {
    return p != q;
  }
}

Obviously, the two methods have the same visible behavior. But they have different bytecode:

$ javap -c Test
Compiled from "Test.java"
class Test {
  Test();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public boolean xor(boolean, boolean);
    Code:
       0: iload_1
       1: iload_2
       2: ixor
       3: ireturn

  public boolean inequal(boolean, boolean);
    Code:
       0: iload_1
       1: iload_2
       2: if_icmpeq     9
       5: iconst_1
       6: goto          10
       9: iconst_0
      10: ireturn
}

If I had to guess, I'd say that xor performs better, since it just returns the result of its comparison; adding in a jump and an extra load just seems like wasted work. But instead of guessing, I benchmarked a few billion calls to both methods using Clojure's "criterium" benchmarking tool. It's close enough that while it looks like xor is a bit faster I'm not good enough at statistics to say whether the results are significant:

user=> (let [t (Test.)] (bench (.xor t true false)))
Evaluation count : 4681301040 in 60 samples of 78021684 calls.
             Execution time mean : 4.273428 ns
    Execution time std-deviation : 0.168423 ns
   Execution time lower quantile : 4.044192 ns ( 2.5%)
   Execution time upper quantile : 4.649796 ns (97.5%)
                   Overhead used : 8.723577 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 25.4745 % Variance is moderately inflated by outliers
user=> (let [t (Test.)] (bench (.inequal t true false)))
Evaluation count : 4570766220 in 60 samples of 76179437 calls.
             Execution time mean : 4.492847 ns
    Execution time std-deviation : 0.162946 ns
   Execution time lower quantile : 4.282077 ns ( 2.5%)
   Execution time upper quantile : 4.813433 ns (97.5%)
                   Overhead used : 8.723577 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 22.2554 % Variance is moderately inflated by outliers

Is there some reason to prefer writing one over the other, performance-wise1? Some context in which the difference in their implementation makes one more suitable than the other? Or, does anyone know why javac implements these two identical operations so differently?

1 Of course, I will not recklessly use this information to micro-optimize. I'm just curious how this all works.

like image 228
amalloy Avatar asked Apr 09 '19 17:04

amalloy


1 Answers

Well, I am going to provide how the CPU translates that shortly and update the post, but in the meanwhile, you are looking at waaaay too small difference to care.

byte-code in java is not an indication of how fast (or not) a method will execute, there are two JIT compilers that will make this method look entirely different once they are hot enough. also javac is known to do very little optimizations once it compiles the code, the real optimizations come from JIT.

I've put up some tests using JMH for this using either C1 compiler only or replacing C2 with GraalVM or no JIT at all... (lots of testing code follows, you can skip it and just look at the results, this is done using jdk-12 btw). This code is using JMH - the de facto tool to use in java world of micro-benchmarks (which are notoriously error-prone if done by hand).

@Warmup(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class BooleanCompare {

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
            .include(BooleanCompare.class.getName())
            .build();

        new Runner(opt).run();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public boolean xor(BooleanExecutionPlan plan) {
        return plan.booleans()[0] ^ plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public boolean plain(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-Xint")
    public boolean xorNoJIT(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-Xint")
    public boolean plainNoJIT(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:-TieredCompilation")
    public boolean xorC2Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:-TieredCompilation")
    public boolean plainC2Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:TieredStopAtLevel=1")
    public boolean xorC1Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:TieredStopAtLevel=1")
    public boolean plainC1Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1,
        jvmArgsAppend = {
            "-XX:+UnlockExperimentalVMOptions",
            "-XX:+EagerJVMCI",
            "-Dgraal.ShowConfiguration=info",
            "-XX:+UseJVMCICompiler",
            "-XX:+EnableJVMCI"
        })
    public boolean xorGraalVM(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1,
        jvmArgsAppend = {
            "-XX:+UnlockExperimentalVMOptions",
            "-XX:+EagerJVMCI",
            "-Dgraal.ShowConfiguration=info",
            "-XX:+UseJVMCICompiler",
            "-XX:+EnableJVMCI"
        })
    public boolean plainGraalVM(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

}

And the results:

BooleanCompare.plain         avgt    2    3.125          ns/op
BooleanCompare.xor           avgt    2    2.976          ns/op

BooleanCompare.plainC1Only   avgt    2    3.400          ns/op
BooleanCompare.xorC1Only     avgt    2    3.379          ns/op

BooleanCompare.plainC2Only   avgt    2    2.583          ns/op
BooleanCompare.xorC2Only     avgt    2    2.685          ns/op

BooleanCompare.plainGraalVM  avgt    2    2.980          ns/op
BooleanCompare.xorGraalVM    avgt    2    3.868          ns/op

BooleanCompare.plainNoJIT    avgt    2  243.348          ns/op
BooleanCompare.xorNoJIT      avgt    2  201.342          ns/op

I am not a versatile enough person to read assembler, though I sometimes like to do that... Here are some interesting things. If we do:

C1 compiler only with !=

/*
 * run many iterations of this with :
 *  java -XX:+UnlockDiagnosticVMOptions  
 *       -XX:TieredStopAtLevel=1  
 *       "-XX:CompileCommand=print,com/so/BooleanCompare.compare"  
 *       com.so.BooleanCompare
 */
public static boolean compare(boolean left, boolean right) {
    return left != right;
}

we get:

  0x000000010d1b2bc7: push   %rbp
  0x000000010d1b2bc8: sub    $0x30,%rsp  ;*iload_0 {reexecute=0 rethrow=0 return_oop=0}
                                         ; - com.so.BooleanCompare::compare@0 (line 22)

  0x000000010d1b2bcc: cmp    %edx,%esi
  0x000000010d1b2bce: mov    $0x0,%eax
  0x000000010d1b2bd3: je     0x000000010d1b2bde
  0x000000010d1b2bd9: mov    $0x1,%eax
  0x000000010d1b2bde: and    $0x1,%eax
  0x000000010d1b2be1: add    $0x30,%rsp
  0x000000010d1b2be5: pop    %rbp

To me, this code is a bit obvious: put 0 into eax, compare (edx, esi) -> if not equal put 1 into eax. return eax & 1.

C1 compiler with ^:

public static boolean compare(boolean left, boolean right) {
     return left ^ right;
}



  # parm0:    rsi       = boolean
  # parm1:    rdx       = boolean
  #           [sp+0x40]  (sp of caller)
  0x000000011326e5c0: mov    %eax,-0x14000(%rsp)
  0x000000011326e5c7: push   %rbp
  0x000000011326e5c8: sub    $0x30,%rsp   ;*iload_0 {reexecute=0 rethrow=0 return_oop=0}
                                          ; - com.so.BooleanCompare::compare@0 (line 22)

  0x000000011326e5cc: xor    %rdx,%rsi
  0x000000011326e5cf: and    $0x1,%esi
  0x000000011326e5d2: mov    %rsi,%rax
  0x000000011326e5d5: add    $0x30,%rsp
  0x000000011326e5d9: pop    %rbp

I don't really know why and $0x1,%esi is needed here, otherwise this is fairly simple too, I guess.

But if I enable C2 compiler, things are a lot more interesting.

/**
 * run with java
 * -XX:+UnlockDiagnosticVMOptions
 * -XX:CICompilerCount=2
 * -XX:-TieredCompilation
 * "-XX:CompileCommand=print,com/so/BooleanCompare.compare"
 * com.so.BooleanCompare
 */
public static boolean compare(boolean left, boolean right) {
    return left != right;
}



  # parm0:    rsi       = boolean
  # parm1:    rdx       = boolean
  #           [sp+0x20]  (sp of caller)
  0x000000011a2bbfa0: sub    $0x18,%rsp
  0x000000011a2bbfa7: mov    %rbp,0x10(%rsp)                

  0x000000011a2bbfac: xor    %r10d,%r10d
  0x000000011a2bbfaf: mov    $0x1,%eax
  0x000000011a2bbfb4: cmp    %edx,%esi
  0x000000011a2bbfb6: cmove  %r10d,%eax                     

  0x000000011a2bbfba: add    $0x10,%rsp
  0x000000011a2bbfbe: pop    %rbp

I don't even see the classic epilog push ebp; mov ebp, esp; sub esp, x, instead something very un-usual (at least for me) via:

 sub    $0x18,%rsp
 mov    %rbp,0x10(%rsp)

 ....
 add    $0x10,%rsp
 pop    %rbp

Again, someone more versatile than me, can explain hopefully. Otherwise it's like a better version of the C1 generated:

xor    %r10d,%r10d // put zero into r10d
mov    $0x1,%eax   // put 1 into eax
cmp    %edx,%esi   // compare edx and esi
cmove  %r10d,%eax  // conditionally move the contents of r10d into eax

AFAIK cmp/cmove is better than cmp/je because of branch-prediction - this is at least what I've read...

XOR with C2 compiler:

public static boolean compare(boolean left, boolean right) {
    return left ^ right;
}



  0x000000010e6c9a20: sub    $0x18,%rsp
  0x000000010e6c9a27: mov    %rbp,0x10(%rsp)                

  0x000000010e6c9a2c: xor    %edx,%esi
  0x000000010e6c9a2e: mov    %esi,%eax
  0x000000010e6c9a30: and    $0x1,%eax
  0x000000010e6c9a33: add    $0x10,%rsp
  0x000000010e6c9a37: pop    %rbp

It sure looks like it's almost the same as C1 compiler generated.

like image 181
Eugene Avatar answered Nov 13 '22 16:11

Eugene