How does(did) a Java runtime targeting an Intel processor without SSE2 deal with floating-point denormals, when strictfp
is set?
Even when the 387 FPU is set for 53-bit precision, it keeps an oversized exponent range that:
Strategies include re-computing the operation that resulted in a denormal value with emulated floating-point, or a permanent exponent offset along the lines of this technique to equip OCaml with 63-bit floats, borrowing a bit from the exponent in order to avoid double-rounding.
In any case, I see no way to avoid at least one conditional branch for each floating-point computation, unless the operation can statically be determined not to underflow/overflow. How exceptional (overflow/underflow) cases are dealt with is part of my question, but this cannot be separated from the question of the representation (the permanent exponent offset strategy seems to mean that only overflows need to be checked for, for instance).
It looks to me, from a very trivial test case, like the JVM round-trips every double
computation through memory to get the rounding it wants. It also seems to do something weird with a couple of magic constants. Here's what it did for me for a simple "compute 2^n naively" program:
0xb1e444b0: fld1
0xb1e444b2: jmp 0xb1e444dd ;*iload
; - fptest::calc@9 (line 6)
0xb1e444b7: nop
0xb1e444b8: fldt 0xb523a2c8 ; {external_word}
0xb1e444be: fmulp %st,%st(1)
0xb1e444c0: fmull 0xb1e44490 ; {section_word}
0xb1e444c6: fldt 0xb523a2bc ; {external_word}
0xb1e444cc: fmulp %st,%st(1)
0xb1e444ce: fstpl 0x10(%esp)
0xb1e444d2: inc %esi ; OopMap{off=51}
;*goto
; - fptest::calc@22 (line 6)
0xb1e444d3: test %eax,0xb3f8d100 ; {poll}
0xb1e444d9: fldl 0x10(%esp) ;*goto
; - fptest::calc@22 (line 6)
0xb1e444dd: cmp %ecx,%esi
0xb1e444df: jl 0xb1e444b8 ;*if_icmpge
; - fptest::calc@12 (line 6)
I believe 0xb523a2c8
and 0xb523a2bc
are _fpu_subnormal_bias1
and _fpu_subnormal_bias2
from the hotspot source code. _fpu_subnormal_bias1
looks to be 0x03ff8000000000000000
and _fpu_subnormal_bias2
looks to be 0x7bff8000000000000000
. _fpu_subnormal_bias1
has the effect of scaling the smallest normal double
to the smallest normal long double
; if the FPU rounds to 53 bits, the "right thing" will happen.
I'd speculate that the seemingly-pointless test
instruction is there so that the thread can be interrupted by marking that page unreadable in the event that a GC is necessary.
Here's the Java code:
import java.io.*;
public strictfp class fptest {
public static double calc(int k) {
double a = 2.0;
double b = 1.0;
for (int i = 0; i < k; i++) {
b *= a;
}
return b;
}
public static double intest() {
double d = 0;
for (int i = 0; i < 4100; i++) d += calc(i);
return d;
}
public static void main(String[] args) throws Exception {
for (int i = 0; i < 100; i++)
System.out.println(intest());
}
}
Digging further, the code for these operations is in plain sight in the OpenJDK code in hotspot/src/cpu/x86/vm/x86_63.ad
. Relevant snippets:
instruct strictfp_mulD_reg(regDPR1 dst, regnotDPR1 src) %{
predicate( UseSSE<=1 && Compile::current()->has_method() && Compile::current()
->method()->is_strict() );
match(Set dst (MulD dst src));
ins_cost(1); // Select this instruction for all strict FP double multiplies
format %{ "FLD StubRoutines::_fpu_subnormal_bias1\n\t"
"DMULp $dst,ST\n\t"
"FLD $src\n\t"
"DMULp $dst,ST\n\t"
"FLD StubRoutines::_fpu_subnormal_bias2\n\t"
"DMULp $dst,ST\n\t" %}
opcode(0xDE, 0x1); /* DE C8+i or DE /1*/
ins_encode( strictfp_bias1(dst),
Push_Reg_D(src),
OpcP, RegOpc(dst),
strictfp_bias2(dst) );
ins_pipe( fpu_reg_reg );
%}
instruct strictfp_divD_reg(regDPR1 dst, regnotDPR1 src) %{
predicate (UseSSE<=1);
match(Set dst (DivD dst src));
predicate( UseSSE<=1 && Compile::current()->has_method() && Compile::current()
->method()->is_strict() );
ins_cost(01);
format %{ "FLD StubRoutines::_fpu_subnormal_bias1\n\t"
"DMULp $dst,ST\n\t"
"FLD $src\n\t"
"FDIVp $dst,ST\n\t"
"FLD StubRoutines::_fpu_subnormal_bias2\n\t"
"DMULp $dst,ST\n\t" %}
opcode(0xDE, 0x7); /* DE F8+i or DE /7*/
ins_encode( strictfp_bias1(dst),
Push_Reg_D(src),
OpcP, RegOpc(dst),
strictfp_bias2(dst) );
ins_pipe( fpu_reg_reg );
%}
I see nothing for addition and subtraction, but I'd bet they just do an add/subtract with the FPU in 53-bit mode and then round-trip the result through memory. I'm a little curious whether there's a tricky overflow case that they get wrong, but I'm not curious enough to find out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With