Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simd matmul program gives different numerical results

I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients.

REAL_T is just a float with typedef

/* This is my matmul Version with simd, using floating simple precision*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  __m256 vA, vB, vC, vRes;
  for (i=0; i<n; i++){
    for (j=0; j<n; j++){  
      for (k=0; k<n; k= k+8){
        vA = _mm256_load_ps(&A[i*n+k]);
        vB = _mm256_loadu_ps(&B[k*n+j]);
        vC = _mm256_mul_ps(vA, vB);
        vC = _mm256_hadd_ps(vC, vC);
        vC = _mm256_hadd_ps(vC, vC);
        /*To get the resulting coefficient, after doing 2 hadds,
        I have to get the first and the last element of the resulting
        Vector vC*/
        C[i*n+j] += ((float )(vC[0])) + ((float )(vC[7]));
      } /* for k */
    } /* for j */
  } /* for i */
}
*/End of program
/*And this is the sequential Version*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  for (i=0; i<n; i++){ 
    for (j=0; j<n; j++){
      for (k=0; k<n; k++){
        C[i*n+j] +=  A[i*n+k] *  B[k*n+j];  
      } /* for k */
    } /* for j */
  } /* for i */  
}
/*End of program*/
/*The matrix are initialized as follows*/
  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++){
      *(A+i*n+j) = 1 / ((REAL_T) (i+j+1));
      *(B+i*n+j) = 1.0;
      *(C+i*n+j) = 1.0;
    }
/*End of initialization*/

The tested matrix are of size 512*512. For the sequential version, the top left square of the resulting matrix gives:

+6.916512e+01  +6.916512e+01  
+5.918460e+01  +5.918460e+01  

+7.946186e+00  +7.946186e+00  
+7.936391e+00  +7.936391e+00  

However, for the simd version, the square is:

+6.916510e+01  +6.916510e+01  
+5.918463e+01  +5.918463e+01  

+7.946147e+00  +7.946147e+00  
+7.936355e+00  +7.936355e+00 

There is as shown a numerical error between the 2 versions. Any help would be really appreciated !

like image 284
abdul rahman taleb Avatar asked Apr 02 '19 14:04

abdul rahman taleb


1 Answers

This looks normal; adding numbers in a different order produces different rounding in the temporaries.

FP math is not associative; optimizing as if it is will change the results.1Is Floating point addition and multiplication associative? / Are floating point operations in C associative?

The amount of change depends on the data. Differences only in the 5th decimal place seems reasonable for float.


Unless you were taking special numerical precautions like adding up the small numbers first, the sequential-order result isn't "more correct", they just have different errors.

In fact, using multiple accumulators generally increases precision for large lists, assuming your numbers all have similar magnitude. (Ideally multiple SIMD vectors each composed of multiple elements, to hide FP-add or FMA latency). https://en.wikipedia.org/wiki/Pairwise_summation is a numerical technique that takes this to the next level: summing subsets of the list in a tree, to avoid adding single array elements to a much larger value. See for example, How to avoid less precise sum for numpy-arrays with multiple columns

Using a fixed number of accumulators (e.g. 8x __m256 = 64 float accumulators) might reduce expected error by a factor of 64, instead of from N to log N for full pairwise summation.


Footnote 1: Associativity is necessary for parallelization, and SIMD, and multiple accumulators. Associativity gives us parallelizability. But what does commutativity give?

On a machine with for example 4-cycle latency 2-per-clock throughput FMA, with a SIMD width of 8 floats, i.e. a Skylake system with AVX2, the potential speedup is 4*2 = 8 from multiple accumulators, * 8 from SIMD width, times number of cores, vs. a pure sequential version, even for problems where it might be less accurate instead of just different.

Most people consider a factor of 8*8 = 64 worth it! (And you can in theory also parallelize for another factor of maybe 4 on a quad-core, assuming perfect scaling for large matrices).

You're already using float instead of double for performance.

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about using multiple accumulators to hide FMA latency in a reduction, exposing that other factor of 8 speedup.

Also, do not use hadd inside an inner-most loop. Sum vertically, and use an efficient reduction at the end of the loop. (Fastest way to do horizontal float vector sum on x86). You really really want to avoid having the compiler extract your vectors to scalar at every step, that defeats most of the benefit of SIMD! Besides the fact that hadd is not worth using for horizontal sums of 1 vector; it costs 2 shuffles + a regular add on all existing CPUs.

like image 133
Peter Cordes Avatar answered Nov 15 '22 12:11

Peter Cordes