I found a problem with host - client float standard in OpenCL. The problem was that the floating points calculated by Opencl is not in the same floating point limits as my visual studio 2010 compiler, when compiling in x86. However when compiling in x64 they are in the same limit. I know it has to be something with, http://www.viva64.com/en/b/0074/
The source I used during testing was: http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism
When i ran the program in x86 it would give me 202 numbers that were equal, when the kernel and the C++ program took square of 1269760 numbers. However in 64 bit build, 1269760 numbers were right, in other words 100 %.
Furthermore, I found that the error between the calculated result of opencl and x86 c++, was 5.5385384e-014, which is a very small fraction but not small enough, compared to the epsilon of the number, which was 2.92212543378266922312416e-19.
That's because, the error needs to be smaller than the epsilon, so that the program can recognize the two numbers as one single equal number. Of course normally one would never compare floats natively, but it is good to know that the float limits are different. And yes i tried to set flt:static, but got the same error.
So I want some sort of explanation for this behavior. Thanks in advance for all answers.
Since nothing changes in the GPU code as you switch your project from x86 to x64, it all has to do as how multiplication is performed on the CPU. There are some subtle differences between floating-point numbers handling in x86 and x64 modes and the biggest one is that since any x64 CPU also supports SSE and SSE2, it is used by default for math operations in 64-bit mode on Windows.
The HD4770 GPU does all computations using single-precision floating point units. Modern x64 CPUs on the other hand have two kinds of functional units that handle floating point numbers:
In 32-bit mode the compiler does not assume that SSE is available and generates usual x87 FPU code to do the math. In this case operations like data[i] * data[i]
are performed internally using the much higher 80-bit precision. Comparison of the kind if (results[i] == data[i] * data[i])
is performed as follows:
data[i]
is pushed onto the x87 FPU stack using the FLD DWORD PTR data[i]
data[i] * data[i]
is computed using FMUL DWORD PTR data[i]
result[i]
is pushed onto the x87 FPU stack using FLD DWORD PTR result[i]
FUCOMPP
Here comes the problem. data[i] * data[i]
resides in an x87 FPU stack element in 80-bit precision. result[i]
comes from the GPU in 32-bit precision. Both numbers will most likely differ since data[i] * data[i]
has much more significant digits whereas result[i]
has lots of zeros (in 80-bit precision)!
In 64-bit mode things happen in another way. The compiler knows that your CPU is SSE capable and it uses SSE instructions to do the math. The same comparison statement is performed in the following way on x64:
data[i]
is loaded into an SSE register using MOVSS XMM0, DWORD PTR data[i]
data[i] * data[i]
is computed using MULSS XMM0, DWORD PTR data[i]
result[i]
is loaded into another SSE register using MOVSS XMM1, DWORD PTR result[i]
UCOMISS XMM1, XMM0
In this case the square operation is performed with the same 32-bit single point precision as is used on the GPU. No intermediate results with 80-bit precision are generated. That's why results are the same.
It is very easy to actually test this even without GPU being involved. Just run the following simple program:
#include <stdlib.h>
#include <stdio.h>
float mysqr(float f)
{
f *= f;
return f;
}
int main (void)
{
int i, n;
float f, f2;
srand(1);
for (i = n = 0; n < 1000000; n++)
{
f = rand()/(float)RAND_MAX;
if (mysqr(f) != f*f) i++;
}
printf("%d of %d squares differ\n", i);
return 0;
}
mysqr
is specifically written so that the intermediate 80-bit result will get converted in 32-bit precision float
. If you compile and run in 64-bit mode, output is:
0 of 1000000 squares differ
If you compile and run in 32-bit mode, output is:
999845 of 1000000 squares differ
In principle you should be able to change the floating point model in 32-bit mode (Project properties -> Configuration Properties -> C/C++ -> Code Generation -> Floating Point Model) but doing so changes nothing since at least on VS2010 intermediate results are still kept in the FPU. What you can do is to enforce store and reload of the computed square so that it will be rounded to 32-bit precision before it is compared with the result from the GPU. In the simple example above this is achieved by changing:
if (mysqr(f) != f*f) i++;
to
if (mysqr(f) != (float)(f*f)) i++;
After the change 32-bit code output becomes:
0 of 1000000 squares differ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With