Division of floating point numbers on GPU different from that on CPU

Tags:

When I divide two floating point numbers on the GPU, the result is 0.196405. When I divide them on CPU, the result is 0.196404. The actual value using the calculator is 0.196404675. How do I make the division on the GPU and the CPU same?

904

asked Dec 18 '12 16:12

Programmer

3 Answers

As the comments to another answer suggest, there are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It's much stronger than that: you can't assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
different compilers interpret the language standards differently, and
some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here:

https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus

143

answered Oct 23 '22 09:10

ArchaeaSoftware

You don't. You should never assume that floating point values will be exactly equal to what you expect after mathematical operations. They are only defined to be correct to a specified precision and will vary slightly from processor to processor, regardless of whether that processor is a CPU or a GPU. An x86 processor, for instance, will actually do floating point computations with 80 bits of precision by default and will then truncate the result to the requested precision. Equivalence comparisons for floating point numbers should always use a tolerance, since no guarantee can be made that any two processors (or even the same processor through different but mathematically equivalent sequences of instructions) will produce the same result. E.g. floating-point numbers a and b should be considered equal if and only if | a - b | < t for some tolerance t.

answered Oct 23 '22 07:10

reirab

Which GPU is used for computation ?

Normally there will be a precision error of +1/-1 in the sixth place of the mantissa part if you are using the single precision floating point operation. this is because of the rounding off error in the GPU.

if you are using the double precision, you will get the exact precision what you are getting in the CPU. but the speed will be almost half that of floating point precision and memory usage will be 2 times. Now from FERMI based architecture onwards NVIDIA GPUs are supporting the double precision point computation support.

answered Oct 23 '22 07:10

Sijo

Related questions
                            
                                GPU programming on Clojure? [closed]
                            
                                Tensorflow import error
                            
                                CUDA wrapper for Qt
                            
                                Aren't NPP functions completely optimized?
                            
                                Pure functional programming to the GPU [closed]
                            
                                What does random_ints(a,N) do and how do I use it in my code? [closed]
                            
                                Equivalent of cudaGetErrorString for cuBLAS?
                            
                                cudaMemcpy() vs cudaMemcpyFromSymbol()
                            
                                cudaMallocManaged() returns "operation not supported"
                            
                                is it possible to develop a cuda program in a virtual machine that has a ubuntu installed
                            
                                Nvidia Tesla vs 480 for CUDA programming [closed]
                            
                                About warp voting function
                            
                                CUDA allocation alignment is 256 bytes - seriously?
                            
                                atomicAdd() for double on GPU
                            
                                cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?
                            
                                Error: could not insert 'nvidia_352' after Installing Cuda on EC2 g2.2xlarge
                            
                                What's the meaning of the params x,y,z,w in function cudaCreateChannelDesc
                            
                                Cuda - nvcc - No kernel image is available for execution on the device. What is the problem?
                            
                                cudaMemcpy - copy an int from host to device error
                            
                                maximum number of threads per block

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With