Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results

I'm using gnumpy to speed up some computations in training a neural network by doing them on GPU. I'm getting the desired speedup but am a little bit worried about the differences in the results of numpy (cpu) vs gnumpy (gpu).

I have the following test script to illustrate the problem:

import gnumpy as gpu
import numpy as np

n = 400

a = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)
b = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print ga.as_numpy_array(dtype=np.float32) - a

which provides the output:

[[  1.52587891e-05  -2.28881836e-05   2.28881836e-05 ...,  -1.52587891e-05
    3.81469727e-05   1.52587891e-05]
 [ -5.34057617e-05  -1.52587891e-05   0.00000000e+00 ...,   1.52587891e-05
    0.00000000e+00   1.52587891e-05]
 [ -1.52587891e-05  -2.28881836e-05   5.34057617e-05 ...,   2.28881836e-05
    0.00000000e+00  -7.62939453e-06]
 ..., 
 [  0.00000000e+00   1.52587891e-05   3.81469727e-05 ...,   3.05175781e-05
    0.00000000e+00  -2.28881836e-05]
 [  7.62939453e-06  -7.62939453e-06  -2.28881836e-05 ...,   1.52587891e-05
    7.62939453e-06   1.52587891e-05]
 [  1.52587891e-05   7.62939453e-06   2.28881836e-05 ...,  -1.52587891e-05
    7.62939453e-06   3.05175781e-05]]

As you can see, the differences are around the magnitude of 10^-5.

So the question is: should I be worried about these differences or is this the expected behaviour?

Additional information:

  • GPU: GeForce GTX 770;
  • numpy version: 1.6.1

I noticed the problem when I used gradient checking (with finite difference approximation) to verify that the small modifications I made to switch from numpy to gnumpy didn't break anything. As one may expect the gradient checking did not work with 32 bit precision (gnumpy does not support float64), but to my surprise the errors differed between CPU and GPU when using the same precision.

The errors on CPU and GPU on a small test neural network are given below: gradient checking errors

Since the error magnitudes are similar, I guess that these differences are OK?

After reading the article, referenced in the comment by BenC, I'm quite sure that the differences can be mostly explained by one of the devices using the fused multiply-add (FMA) instruction and the other not.

I implemented the example from the paper:

import gnumpy as gpu
import numpy as np

a=np.array([1.907607,-.7862027, 1.147311, .9604002], dtype=np.float32)
b=np.array([-.9355000, -.6915108, 1.724470, -.7097529], dtype=np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print "CPU", a
print "GPU", ga
print "DIFF", ga - a

>>>CPU 0.0559577
>>>GPU 0.0559577569366
>>>DIFF 8.19563865662e-08

...and the difference is similar to FMA vs serial algorithm (though for some reason both results differ from the exact result more than in the paper).

The GPU I'm using (GeForce GTX 770) supports FMA instruction while the CPU does not (I have an Ivy Bridge Intel® Xeon(R) CPU E3-1225 V2, but intel introduced the FMA3 instruction in their products with Haswell).

Other possible explanations include the different math libraries used in the background or differences in the sequence of operations caused by, for example, the different level of parallelization on CPU vs GPU.

like image 656
Ottokar Avatar asked Jan 09 '14 12:01

Ottokar


People also ask

Why can GPU do matrix multiplication faster than CPU?

To sum it up, it's because of the hardware specialization. We can tailor the chip architecture to balance between specialization and efficiency (more flexible vs more efficient). For example, GPU is highly specialized for parallel processing, while CPU is designed to handle many different kinds of operations.

How much faster is GPU than CPU in matrix multiplication?

The results presented in this paper show that the GPU implementation with the use of shared memory is two times faster than the implementation that uses only device's global memory and up to 7.5 times faster than the CPU implementation.

Is NumPy matrix multiplication faster?

Faster libraries: Numpy As an example, let's compute matrix powers. Specifically, we compute A16 where A is a 100×100 matrix. Our plain Python solution takes 11.77 seconds to run, while using Numpy to perform the multiplications and generate the matrices takes 0.0097 seconds to run.

What is the difference between dot and Matmul?

matmul differs from dot in two important ways. Multiplication by scalars is not allowed. Stacks of matrices are broadcast together as if the matrices were elements.


2 Answers

I would recommend using np.allclose for testing whether two float arrays are nearly equal.

Whereas you are only looking at the absolute difference between the values in your two result arrays, np.allclose also considers their relative differences. Suppose, for example, that the values in your input arrays were 1000x greater - then the absolute differences between the two results will also be 1000x greater, but that doesn't mean the two dot products were any less precise.

np.allclose will return True only if the following condition is met for every corresponding pair of elements in your two test arrays, a and b:

abs(a - b) <= (atol + rtol * abs(b))

By default, rtol=1e-5 and atol=1e-8. These tolerances are a good 'rule of thumb', but whether they are small enough in your case will depend on your particular application. For example, if you're dealing with values < 1e-8, then an absolute difference of 1e-8 would be a total disaster!

If you try calling np.allclose on your two results with the default tolerances, you'll find that np.allclose returns True. My guess, then, is that these differences are probably small enough that they're not worth worrying about. It really depends on what you're doing with the results.

like image 74
ali_m Avatar answered Sep 24 '22 13:09

ali_m


The RTX cards do floating point at half-precision because its faster for image rendering. You must tell the GPU to use full precision when multiplying floating point for AI. The precision is extremely important when doing AI.

I experienced this same Floating point difference you did when trying to use Cuda with an RTX 2080 Ti.

like image 44
Alex Bedro Avatar answered Sep 21 '22 13:09

Alex Bedro