Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results

Tags:

I'm using gnumpy to speed up some computations in training a neural network by doing them on GPU. I'm getting the desired speedup but am a little bit worried about the differences in the results of numpy (cpu) vs gnumpy (gpu).

I have the following test script to illustrate the problem:

import gnumpy as gpu
import numpy as np

n = 400

a = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)
b = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print ga.as_numpy_array(dtype=np.float32) - a

which provides the output:

[[  1.52587891e-05  -2.28881836e-05   2.28881836e-05 ...,  -1.52587891e-05
    3.81469727e-05   1.52587891e-05]
 [ -5.34057617e-05  -1.52587891e-05   0.00000000e+00 ...,   1.52587891e-05
    0.00000000e+00   1.52587891e-05]
 [ -1.52587891e-05  -2.28881836e-05   5.34057617e-05 ...,   2.28881836e-05
    0.00000000e+00  -7.62939453e-06]
 ..., 
 [  0.00000000e+00   1.52587891e-05   3.81469727e-05 ...,   3.05175781e-05
    0.00000000e+00  -2.28881836e-05]
 [  7.62939453e-06  -7.62939453e-06  -2.28881836e-05 ...,   1.52587891e-05
    7.62939453e-06   1.52587891e-05]
 [  1.52587891e-05   7.62939453e-06   2.28881836e-05 ...,  -1.52587891e-05
    7.62939453e-06   3.05175781e-05]]

As you can see, the differences are around the magnitude of 10^-5.

So the question is: should I be worried about these differences or is this the expected behaviour?

Additional information:

GPU: GeForce GTX 770;
numpy version: 1.6.1

I noticed the problem when I used gradient checking (with finite difference approximation) to verify that the small modifications I made to switch from numpy to gnumpy didn't break anything. As one may expect the gradient checking did not work with 32 bit precision (gnumpy does not support float64), but to my surprise the errors differed between CPU and GPU when using the same precision.

The errors on CPU and GPU on a small test neural network are given below: gradient checking errors

Since the error magnitudes are similar, I guess that these differences are OK?

After reading the article, referenced in the comment by BenC, I'm quite sure that the differences can be mostly explained by one of the devices using the fused multiply-add (FMA) instruction and the other not.

I implemented the example from the paper:

import gnumpy as gpu
import numpy as np

a=np.array([1.907607,-.7862027, 1.147311, .9604002], dtype=np.float32)
b=np.array([-.9355000, -.6915108, 1.724470, -.7097529], dtype=np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print "CPU", a
print "GPU", ga
print "DIFF", ga - a

>>>CPU 0.0559577
>>>GPU 0.0559577569366
>>>DIFF 8.19563865662e-08

...and the difference is similar to FMA vs serial algorithm (though for some reason both results differ from the exact result more than in the paper).

The GPU I'm using (GeForce GTX 770) supports FMA instruction while the CPU does not (I have an Ivy Bridge Intel® Xeon(R) CPU E3-1225 V2, but intel introduced the FMA3 instruction in their products with Haswell).

Other possible explanations include the different math libraries used in the background or differences in the sequence of operations caused by, for example, the different level of parallelization on CPU vs GPU.

656

asked Jan 09 '14 12:01

Ottokar

2 Answers

I would recommend using np.allclose for testing whether two float arrays are nearly equal.

Whereas you are only looking at the absolute difference between the values in your two result arrays, np.allclose also considers their relative differences. Suppose, for example, that the values in your input arrays were 1000x greater - then the absolute differences between the two results will also be 1000x greater, but that doesn't mean the two dot products were any less precise.

np.allclose will return True only if the following condition is met for every corresponding pair of elements in your two test arrays, a and b:

abs(a - b) <= (atol + rtol * abs(b))

By default, rtol=1e-5 and atol=1e-8. These tolerances are a good 'rule of thumb', but whether they are small enough in your case will depend on your particular application. For example, if you're dealing with values < 1e-8, then an absolute difference of 1e-8 would be a total disaster!

If you try calling np.allclose on your two results with the default tolerances, you'll find that np.allclose returns True. My guess, then, is that these differences are probably small enough that they're not worth worrying about. It really depends on what you're doing with the results.

answered Sep 24 '22 13:09

ali_m

The RTX cards do floating point at half-precision because its faster for image rendering. You must tell the GPU to use full precision when multiplying floating point for AI. The precision is extremely important when doing AI.

I experienced this same Floating point difference you did when trying to use Cuda with an RTX 2080 Ti.

answered Sep 21 '22 13:09

Alex Bedro

Related questions
                            
                                What does this Python statement mean?
                            
                                Django ORM - mock values().filter() chain
                            
                                Python: getting correct string length when it contains surrogate pairs
                            
                                Can I write italics to the Python shell?
                            
                                Inherit a parent class docstring as __doc__ attribute
                            
                                Update a PostgreSQL array using SQLAlchemy
                            
                                Calculate camera world position with OpenCV Python
                            
                                Combine columns from several CSV files into a single file
                            
                                How can I say a file is SVG without using a magic number?
                            
                                How to set the Python 2 Preference in PyCharm?
                            
                                Fastest method to generate big random string with lower Latin letters
                            
                                Python numpy.random.normal only positive values
                            
                                Python - Removing overlapping lists
                            
                                Get indices that satisfy some criteria
                            
                                re.search Multiple lines Python
                            
                                What is the difference between `super(...)` and `return super(...)`?
                            
                                How to resolve 'str' has no attribute 'maketrans' error in python?
                            
                                Basics of Simulated Annealing in Python [closed]
                            
                                Deprecation warning in scikit-learn svmlight format loader
                            
                                Is it reliable to compare two isoformat datetime strings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results

Tags:

python

precision

numpy

cuda

Ottokar

People also ask

2 Answers

ali_m

Alex Bedro

Recent Activity

Donate For Us