We recently started seeing unit tests fail on our build machine (certain numerical calculations fell out of tolerance). Upon investigation we found that some of our developers could not reproduce the test failure. To cut a long story short, we eventually tracked the problem down to what appeared to be a rounding error, but that error was only occurring with x64 builds on the latest Haswell chips (to which our build server was recently upgraded). We narrowed it down and pulled out a single calculation from one of our tests:
#include "stdafx.h"
#include <cmath>
int _tmain(int argc, _TCHAR* argv[])
{
double rate = 0.0021627412080263146;
double T = 4.0246575342465754;
double res = exp(-rate * T);
printf("%0.20e\n", res);
return 0;
}
When we compile this x64 in VS2013 (with the default compiler switches, including /fp:precise
), it gives different results on the older Sandy Bridge chip and the newer Haswell chip. The difference is in the 15th significant digit, which I think is outside the machine epsilon for double on both machines).
If we compile the same code in VS2010 or VS2012 (or, incidentally, VS2013 x86) we get the exact same answer on both chips.
In the past several years, we've gone through many versions of Visual Studio and many different Intel chips for testing, and no-one can recall us ever having to adjust our regression test expectations based on different rounding errors between chips.
This obviously led to a game of whack-a-mole between developers with the older and newer hardware as to what should be the expectation for the tests...
Is there a compiler option in VS2013 that we need to be using to somehow mitigate the discrepancy?
Update:
Results on Sandy Bridge developer PC:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983898980000e-001
Results on Haswell build server:
VS2010-compiled-x64: 9.91333479983898980000e-001
VS2012-compiled-x64: 9.91333479983898980000e-001
VS2013-compiled-x64: 9.91333479983899090000e-001
Update:
I used procexp to capture the list of DLLs loaded into the test program.
Sandy Bridge developer PC:
apisetschema.dll
ConsoleApplication8.exe
kernel32.dll
KernelBase.dll
locale.nls
msvcr120.dll
ntdll.dll
Haswell build server:
ConsoleApplication8.exe
kernel32.dll
KernelBase.dll
locale.nls
msvcr120.dll
ntdll.dll
The results you documented are affected by the value of the MXCSR register, the two bits that select the rounding mode are important here. To get the "happy" number you like, you need to force the processor to round down. Like this:
#include "stdafx.h"
#include <cmath>
#include <float.h>
int _tmain(int argc, _TCHAR* argv[]) {
unsigned prev;
_controlfp_s(&prev, _RC_DOWN, _MCW_RC);
double rate = 0.0021627412080263146;
double T = 4.0246575342465754;
double res = exp(-rate * T);
printf("%0.20f\n", res);
return 0;
}
Output: 0.99133347998389898000
Change _RC_DOWN
to _RC_NEAR
to have MXCSR in normal rounding mode, the way the operating system programs it before it starts your program. Which produces 0.99133347998389909000. Or in other words, your Haswell machines are in fact producing the expected value.
Exactly how this happened can be very hard to diagnose, the control register is the worst possible global variable you can think of. The usual cause is an injected DLL that reprograms the FPU. A debugger can show the loaded DLLs, compare the lists between the two machines to find a candidate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With