Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I perform 8 x 8 matrix operation using SSE?

My initial attempt looked like this (supposed we want to multiply)

  __m128 mat[n]; /* rows */
  __m128 vec[n] = {1,1,1,1};
  float outvector[n];
   for (int row=0;row<n;row++) {
       for(int k =3; k < 8; k = k+ 4)
       {
           __m128 mrow = mat[k];
           __m128 v = vec[row];
           __m128 sum = _mm_mul_ps(mrow,v);
           sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
       }
           _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
 }

But this clearly doesn't work. How do I approach this?

I should load 4 at a time....

The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?

like image 523
user1012451 Avatar asked Nov 27 '11 13:11

user1012451


People also ask

What are the operations we can perform in a matrix?

Addition, subtraction and multiplication are the basic operations on the matrix.

How do you multiply matrices?

How to multiply two given matrices? To multiply one matrix with another, we need to check first, if the number of columns of the first matrix is equal to the number of rows of the second matrix. Now multiply each element of the column of the first matrix with each element of rows of the second matrix and add them all.


Video Answer


1 Answers

OK... I'll use a row-major matrix convention. Each row of [m] requires (2) __m128 elements to yield 8 floats. The 8x1 vector v is a column vector. Since you're using the haddps instruction, I'll assume SSE3 is available. Finding r = [m] * v :

void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
{
    __m128 t0, t1, t2, t3, r0, r1, r2, r3;

    t0 = _mm_mul_ps(m[0][0], v[0]);
    t1 = _mm_mul_ps(m[1][0], v[0]);
    t2 = _mm_mul_ps(m[2][0], v[0]);
    t3 = _mm_mul_ps(m[3][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r0 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[0][1], v[1]);
    t1 = _mm_mul_ps(m[1][1], v[1]);
    t2 = _mm_mul_ps(m[2][1], v[1]);
    t3 = _mm_mul_ps(m[3][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r1 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][0], v[0]);
    t1 = _mm_mul_ps(m[5][0], v[0]);
    t2 = _mm_mul_ps(m[6][0], v[0]);
    t3 = _mm_mul_ps(m[7][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r2 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][1], v[1]);
    t1 = _mm_mul_ps(m[5][1], v[1]);
    t2 = _mm_mul_ps(m[6][1], v[1]);
    t3 = _mm_mul_ps(m[7][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r3 = _mm_hadd_ps(t0, t2);

    r[0] = _mm_add_ps(r0, r1);
    r[1] = _mm_add_ps(r2, r3);
}

As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.

The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.

like image 108
Brett Hale Avatar answered Oct 23 '22 03:10

Brett Hale