I've found that one of my application posted on the market produces weird results on some phones. Upon investigation it turns out there is an issue with one function which computes distance between two GeoPoints - sometimes it returns completely wrong value. This issue reproduces only on devices with MediaTek MT6589 SoC (aka MTK6589). And AFAIK all of such devices have Android 4.2 installed.
Update I was also able to reproduce the bug on Lenovo S6000 tablet with MediaTek MT8125/8389 chip and on Fly IQ444 Quattro with MT6589 and with Android 4.1 installed.
I created a test project which helps to reproduce the bug. It runs computation repeatedly for 1'000 or 100'000 iterations. To exclude possibility of threading issues computation is performed on the UI thread (with small pauses to keep UI responding). In the test project I used just a part from the original distance formula:
private double calcX() {
double t = 1.0;
double X = 0.5 + t / 16384;
return X;
}
As you can check by yourself on web2.0calc.com the value of X
should be approximately: 0.50006103515625
.
However on the devices with MT6589 chip often the wrong value is computed: 2.0
.
Project is available at Google Code (APK is available also). The source of the test class is presented below:
public class MtkTestActivity extends Activity {
static final double A = 0.5;
static final double B = 1;
static final double D = 16384;
static final double COMPUTED_CONST = A + B / D;
/*
* Main calculation where bug occurs
*/
public double calcX() {
double t = B;
double X = A + t / D;
return X;
}
class TestRunnable implements Runnable {
static final double EP = 0.00000000001;
static final double EXPECTED_LOW = COMPUTED_CONST - EP;
static final double EXPECTED_HIGH = COMPUTED_CONST + EP;
public void run() {
for (int i = 0; i < SMALL_ITERATION; i++) {
double A = calcX();
if (A < EXPECTED_LOW || A > EXPECTED_HIGH) {
mFailedInCycle = true;
mFails++;
mEdit.getText().append("FAILED on " + mIteration + " iteration with: " + A + '\n');
}
mIteration++;
}
if (mIteration % 5000 == 0) {
if (mFailedInCycle) {
mFailedInCycle = false;
} else {
mEdit.getText().append("passed " + mIteration + " iterations\n");
}
}
if (mIteration < mIterationsCount) {
mHandler.postDelayed(new TestRunnable(), DELAY);
} else {
mEdit.getText().append("\nFinished test with " + mFails + " fails");
}
}
}
public void onTestClick(View v) {
startTest(IT_10K);
}
public void onTestClick100(View v) {
startTest(IT_100K);
}
private void startTest(int iterationsCount) {
Editable text = mEdit.getText();
text.clear();
text.append("\nStarting " + iterationsCount + " iterations test...");
text.append("\n\nExpected result " + COMPUTED_CONST + "\n\n");
mIteration = 0;
mFails = 0;
mFailedInCycle = false;
mIterationsCount = iterationsCount;
mHandler.postDelayed(new TestRunnable(), 100);
}
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
mHandler = new Handler(getMainLooper());
mEdit = (EditText) findViewById(R.id.edtText1);
}
private static final int IT_10K = 1000;
private static final int IT_100K = 100000;
private static final int SMALL_ITERATION = 50;
private static final int DELAY = 10;
private int mIteration;
private int mFails;
private boolean mFailedInCycle;
private Handler mHandler;
private int mIterationsCount;
private EditText mEdit;
}
To fix the issue it's enough to just change all double
to float
in calcX()
method.
Further investigation
Turning off JIT (by adding android:vmSafeMode="true"
to the app manifest) fixes bug as well.
Have anyone seen this bug before? Maybe this is a known issue?
p.s.: if anyone would be able to reproduce this bug on the device with other chip, or could test it with any MediaTek chip and Android >= 4.3, I will highly appreciate it.
This was a JIT bug that was active in the JellyBean source from late 2012 through early 2013. In short, if two or more double-precision constants that were different in the high 32 bits, but identical in the low 32 bits were used in the same basic block the JIT would think they were the same, and inappropriately optimize one of them away.
I introduced the defect in: https://android-review.googlesource.com/#/c/47280/
and fixed it in: https://android-review.googlesource.com/#/c/57602/
The defect should not appear in any recent Android builds.
Have anyone seen this bug before? Maybe this is a known issue?
These show up on occasion on a couple of the Android mailing lists.
I believe what you are seeing is the effect of (1) different CPUs and their handling of floating point values, and (2) storage sizes differences that lead to different roundings and truncations.
For (1) use something like the following is used in native code:
_controlfp(_PC_24, _MCW_PC);
_controlfp(_RC_NEAR, _MCW_RC);
For (2) use the common storage size, which is a float
.
There's sometimes another related problem in the native world: a float is passed to a function, but the value at the function is always 0.0f
(rather than the non-0 value used to invoke the function). You can clear that with -mfloat-abi=softfp
. See Hard-float and JNI.
Unfortunately, you are at the mercy of the manufacturer when using their port of Android Java. Enjoy their tweaks, oversights and implementation bugs. At least its not corrupting your VM.
I spent the last week investigating this issue and here's what I've found:
X = A + b / D
dalvikvm
directly, passing parameters to it. This allowed to set jit threshold and receive output ARM code generated by JIT -Xjitdisableopt:1
to the Dalvik fixes the issue (this parameter disables kLoadStoreElimination optimization). One may also add dalvik.vm.extra-opts=-Xjitdisableopt:1
to build.prop
file as a quick workaround which preserves JIT (root and reboot are required)libdvm.so
(from Fly IQ4410 with MT6589 chip) on the emulator and the bug reproduced there. But if I use libdvm.so
compiled from the Android 4.2 sources the bug disappears. It looks like there is an issue with JIT compiled code produced by some specific version of libdvm
library shipped with affected deviceslibdvm.so
from Fly IQ444 (MT6589, Android 4.1.2)I've submitted a bug report #65750.
Here are the source and the JIT assembly output of the test used to reproduce the bug:
public class Calc {
static final double A = 0.5;
static final double B = 1;
static final double D = 16384;
public double calcX() {
double t = B;
double X = A + t / D;
return X;
}
}
JIT output for a usual run of Dalvik:
D/dalvikvm: Dumping LIR insns
D/dalvikvm: installed code is at 0x45deb000
D/dalvikvm: total size is 124 bytes
D/dalvikvm: 0x45deb000 (0000): data 0xc278(49784)
D/dalvikvm: 0x45deb002 (0002): data 0x457a(17786)
D/dalvikvm: 0x45deb004 (0004): data 0x0044(68)
D/dalvikvm: 0x45deb006 (0006): ldr r0, [r15pc, -#8]
D/dalvikvm: 0x45deb00a (000a): ldr r1, [r0, #0]
D/dalvikvm: 0x45deb00c (000c): adds r1, r1, #1
D/dalvikvm: 0x45deb00e (000e): str r1, [r0, #0]
D/dalvikvm: -------- entry offset: 0x0000
D/dalvikvm: L0x4579e28c:
D/dalvikvm: -------- dalvik offset: 0x0000 @ const-wide/high16 v0, (#16368), (#0)
D/dalvikvm: 0x45deb010 (0010): vldr d8, [r15, #96]
D/dalvikvm: -------- dalvik offset: 0x0002 @ const-wide/high16 v2, (#16352), (#0)
D/dalvikvm: 0x45deb014 (0014): vmov.f64 d9, d8
D/dalvikvm: -------- dalvik offset: 0x0004 @ const-wide/high16 v4, (#16592), (#0)
D/dalvikvm: 0x45deb018 (0018): vmov.f64 d10, d9
D/dalvikvm: -------- dalvik offset: 0x0006 @ div-double/2addr v0, v4, (#0)
D/dalvikvm: 0x45deb01c (001c): vdivd d8, d8, d10
D/dalvikvm: -------- dalvik offset: 0x0007 @ add-double/2addr v0, v2, (#0)
D/dalvikvm: 0x45deb020 (0020): vadd d8, d8, d9
D/dalvikvm: -------- dalvik offset: 0x0008 @ return-wide v0, (#0), (#0)
D/dalvikvm: 0x45deb024 (0024): vmov.f64 d11, d8
D/dalvikvm: 0x45deb028 (0028): vstr d11, [r6, #16]
D/dalvikvm: 0x45deb02c (002c): vstr d8, [r5, #0]
D/dalvikvm: 0x45deb030 (0030): vstr d10, [r5, #16]
D/dalvikvm: 0x45deb034 (0034): vstr d9, [r5, #8]
D/dalvikvm: 0x45deb038 (0038): blx_1 0x45dea028
D/dalvikvm: 0x45deb03a (003a): blx_2 see above
D/dalvikvm: 0x45deb03c (003c): b 0x45deb040 (L0x4579f068)
D/dalvikvm: 0x45deb03e (003e): undefined
D/dalvikvm: L0x4579f068:
D/dalvikvm: -------- reconstruct dalvik PC : 0x457b83f4 @ +0x0008
D/dalvikvm: 0x45deb040 (0040): ldr r0, [r15pc, #28]
D/dalvikvm: Exception_Handling:
D/dalvikvm: 0x45deb044 (0044): ldr r1, [r6, #108]
D/dalvikvm: 0x45deb046 (0046): blx r1
D/dalvikvm: -------- end of chaining cells (0x0048)
D/dalvikvm: 0x45deb060 (0060): .word (0x457b83f4)
D/dalvikvm: 0x45deb064 (0064): .word (0)
D/dalvikvm: 0x45deb068 (0068): .word (0x40d00000)
D/dalvikvm: 0x45deb06c (006c): .word (0)
D/dalvikvm: 0x45deb070 (0070): .word (0x3fe00000)
D/dalvikvm: 0x45deb074 (0074): .word (0)
D/dalvikvm: 0x45deb078 (0078): .word (0x3ff00000)
D/dalvikvm: End LCalc;calcX, 6 Dalvik instructions.
The most interesting part is:
vldr d8, [r15, #96] ; d8 := 1.0
vmov.f64 d9, d8 ; d9 := d8
vmov.f64 d10, d9 ; d10 := d9 // now d8, d9 and d10 contains 1.0 !!!
vdivd d8, d8, d10 ; d8 := d8 / d10 = 1.0
vadd d8, d8, d9 ; d8 := d8 + d9 = 2.0
vmov.f64 d11, d8
Well the code produced by JIT looks completely wrong. Instead of three only one constant is read 1.0, and as a result we receive the computation of X = 1.0 + 1.0 / 1.0
which not surprisingly evaluates to 2.0
And here is the JIT output for Dalvik run with kLoadStoreElimination
optimization disabled (which fixes the bug):
D/dalvikvm: Dumping LIR insns
D/dalvikvm: installed code is at 0x45d64000
D/dalvikvm: total size is 124 bytes
D/dalvikvm: 0x45d64000 (0000): data 0x5260(21088)
D/dalvikvm: 0x45d64002 (0002): data 0x4572(17778)
D/dalvikvm: 0x45d64004 (0004): data 0x0044(68)
D/dalvikvm: 0x45d64006 (0006): ldr r0, [r15pc, -#8]
D/dalvikvm: 0x45d6400a (000a): ldr r1, [r0, #0]
D/dalvikvm: 0x45d6400c (000c): adds r1, r1, #1
D/dalvikvm: 0x45d6400e (000e): str r1, [r0, #0]
D/dalvikvm: -------- entry offset: 0x0000
D/dalvikvm: L0x45717274:
D/dalvikvm: -------- dalvik offset: 0x0000 @ const-wide/high16 v0, (#16368), (#0)
D/dalvikvm: 0x45d64010 (0010): vldr d8, [r15, #96]
D/dalvikvm: -------- dalvik offset: 0x0002 @ const-wide/high16 v2, (#16352), (#0)
D/dalvikvm: 0x45d64014 (0014): vldr d10, [r15, #76]
D/dalvikvm: 0x45d64018 (0018): vldr d9, [r15, #80]
D/dalvikvm: 0x45d6401c (001c): vstr d9, [r5, #8]
D/dalvikvm: -------- dalvik offset: 0x0004 @ const-wide/high16 v4, (#16592), (#0)
D/dalvikvm: 0x45d64020 (0020): vstr d10, [r5, #16]
D/dalvikvm: -------- dalvik offset: 0x0006 @ div-double/2addr v0, v4, (#0)
D/dalvikvm: 0x45d64024 (0024): vdivd d8, d8, d10
D/dalvikvm: -------- dalvik offset: 0x0007 @ add-double/2addr v0, v2, (#0)
D/dalvikvm: 0x45d64028 (0028): vadd d8, d8, d9
D/dalvikvm: 0x45d6402c (002c): vstr d8, [r5, #0]
D/dalvikvm: -------- dalvik offset: 0x0008 @ return-wide v0, (#0), (#0)
D/dalvikvm: 0x45d64030 (0030): vmov.f64 d11, d8
D/dalvikvm: 0x45d64034 (0034): vstr d11, [r6, #16]
D/dalvikvm: 0x45d64038 (0038): blx_1 0x45d63028
D/dalvikvm: 0x45d6403a (003a): blx_2 see above
D/dalvikvm: 0x45d6403c (003c): b 0x45d64040 (L0x45718050)
D/dalvikvm: 0x45d6403e (003e): undefined
D/dalvikvm: L0x45718050:
D/dalvikvm: -------- reconstruct dalvik PC : 0x457313f4 @ +0x0008
D/dalvikvm: 0x45d64040 (0040): ldr r0, [r15pc, #28]
D/dalvikvm: Exception_Handling:
D/dalvikvm: 0x45d64044 (0044): ldr r1, [r6, #108]
D/dalvikvm: 0x45d64046 (0046): blx r1
D/dalvikvm: -------- end of chaining cells (0x0048)
D/dalvikvm: 0x45d64060 (0060): .word (0x457313f4)
D/dalvikvm: 0x45d64064 (0064): .word (0)
D/dalvikvm: 0x45d64068 (0068): .word (0x40d00000)
D/dalvikvm: 0x45d6406c (006c): .word (0)
D/dalvikvm: 0x45d64070 (0070): .word (0x3fe00000)
D/dalvikvm: 0x45d64074 (0074): .word (0)
D/dalvikvm: 0x45d64078 (0078): .word (0x3ff00000)
D/dalvikvm: End LCalc;calcX, 6 Dalvik instructions
All three constants loaded as expected and correct evaluation is performed.
The problem you are facing could possibly be related to the processor hardware.
There are some notorious examples in computing history:
1994, Some Intel Pentium processors did have an error, producing floating point calculation errors(FDIV bug). This was only as of the 4th digit after the decimal point. Intel ultimately put in place a replacement program to swap the defective CPUs for good ones.
The DEC VAX 11/785 (introduced 1984) had a design flaw in its (optional)floating point coprocessor. Due to a race condition in hardware, sometimes the floating point coprocessor returned an arbitrary value instead of the desired result on some machines. Digital Equipment Corporation put in place a program to replace the (5 large printed circuit boards) coprocessor at all customers with a hardware maintenance contract in place.
I'd suggest that you could try to do some more testing on a wider hardware base to better understand the problem. If the problem would be really be related to hardware, I'm guessing your best approach would be to find a way to work around the problem & document it for other developers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With