I have some problems while porting some complex under macOS/arm64 and ended up with the following trivial code to exhibit the different behavior w.r.t. macOS/x86_64 (using native osx/arm64 clang version 14.0.6 from conda-forge, and cross compiling for x86_64):
#include "assert.h"
#include "stdio.h"
int main()
{
double y[2] = {-0.01,0.9};
double r;
r = y[0]+0.03*y[1];
printf("r = %24.26e\n",r);
assert(r == 0.017);
}
The results on arm64 is
$ clang -arch arm64 test.c -o test; ./test
Assertion failed: (r == 0.017), function main, file test.c, line 9.
r = 1.69999999999999977517983751e-02
zsh: abort ./test
while the result on x86_64 is
$ clang -arch x86_64 test.c -o test; ./test
r = 1.70000000000000012212453271e-02
$
The test program has also been compiled/run on a x86_64 machine, it yields the same result as above (cross compiled on arm64 and run with Rosetta).
In fact it doesn't matter that the arm64 result is not bitwise equal to 1.7 parsed and stored as a IEEE754 number, but rather the different value of the expression w.r.t. x86_64.
In order to check eventual different conventions (e.g. rounding mode), the following program has been compiled and run on both platforms
#include <iostream>
#include <limits>
#define LOG(x) std::cout << #x " = " << x << '\n'
int main()
{
using l = std::numeric_limits<double>;
LOG(l::digits);
LOG(l::round_style);
LOG(l::epsilon());
LOG(l::min());
return 0;
}
it yields the same result:
l::digits = 53
l::round_style = 1
l::epsilon() = 2.22045e-16
l::min() = 2.22507e-308
hence the problem seems to be elsewhere.
If it can help: under arm64 the result obtained with the expression is the same as the one obtained by calling refBLAS ddot with vectors {1,0.03}
and y
.
The toolchain seems to be the cause. Using the default toolchain of macOS 11.6.1:
mottelet@portmottelet-cr-1 ~ % clang -v
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
gives the same results for both architecture ! So the problem seems to be in the actual toolchain I am using: I use the version 1.5.2 of conda package cxx-compiler
(I need conda as a package manager because the application I am building has a lot of dependencies that conda provides me).
Using -v
shows a bunch of compilation flags, which one would be eventually incriminated ?
The results differ in the least significant bit due to different rounding given the compilers and architectures. You can use %a
to see all of the bits in the double in hex. Then you get on arm64:
0x1.16872b020c49bp-6
and on x86_64:
0x1.16872b020c49cp-6
The IEEE 754 standard by itself does not guarantee exactly the same results across conforming implementations, in particular due to destination accuracy, decimal conversions, and instruction choices. Variations in the least significant bit, or more with multiple operations, can and should be expected.
In this case, the fmadd
operation on the arm64 architecture is used, doing the multiply and add in a single operation. That gives a different result than the separate multiply and add XMM operations used in the x86_64 architecture.
In the comments, Eric points out the C library function fma()
to do a combined multiply-add. Indeed, if I use that call on the x86_64 architecture (as well as on arm64), I get the arm64 fmadd
result.
You could potentially get different behavior in the same architecture if the compiler optimizes away the operation, as it should in the example. Then the compiler is doing the computation. The compiler could very well use separate multiply and add operations at compile time, giving a different result on arm64 than the fmadd
operation when not optimized out. Also if you're cross-compiling, then the optimized-out calculation could depend the architecture of the machine you're compiling on, as opposed to the one you're running it on.
Comparison for exact equality of floating point values is fraught with peril. Whenever you see yourself attempting that, you need to think more deeply about your intent.
It appears that clang behavior changed between 13.x and 14.x. When using -O
, the result is computed at compile time and the target's floating point has nothing to do with it, so this is strictly a compiler issue.
Try on godbolt
The difference is easier to see in hex float output. clang 13 and earlier computes the value 0x1.16872b020c49cp-6
which is slightly greater than 1.7. clang 14 and later computes 0x1.16872b020c49bp-6
which is slightly less (different by 1 in the least significant bit).
The same discrepancy exists between the two versions whether on arm64 or x86-64.
I am not sure offhand which one is better or worse. I guess you could git bisect
if you really care, and look at the rationale for the corresponding commit and see whether it seems to be correct. For comparison, gcc in all versions tested gives the "old clang" value of 0x1.16872b020c49cp-6
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With