Efficient 4x4 matrix multiplication (C vs assembly)

Question

I'm looking for a faster and trickier way to multiply two 4x4 matrices in C. My current research is focused on x86-64 assembly with SIMD extensions. So far, I've created a function witch is about 6x faster than a naive C implementation, which has exceeded my expectations for the performance improvement. Unfortunately, this stays true only when no optimization flags are used for compilation (GCC 4.7). With -O2, C becomes faster and my effort becomes meaningless.

I know that modern compilers make use of complex optimization techniques to achieve an almost perfect code, usually faster than an ingenious piece of hand-crafed assembly. But in a minority of performance-critical cases, a human may try to fight for clock cycles with the compiler. Especially, when some mathematics backed with a modern ISA can be explored (as it is in my case).

My function looks as follows (AT&T syntax, GNU Assembler):

    .text     .globl matrixMultiplyASM     .type matrixMultiplyASM, @function matrixMultiplyASM:     movaps   (%rdi), %xmm0    # fetch the first matrix (use four registers)     movaps 16(%rdi), %xmm1     movaps 32(%rdi), %xmm2     movaps 48(%rdi), %xmm3     xorq %rcx, %rcx           # reset (forward) loop iterator .ROW:     movss (%rsi), %xmm4       # Compute four values (one row) in parallel:     shufps $0x0, %xmm4, %xmm4 # 4x 4FP mul's, 3x 4FP add's 6x mov's per row,     mulps %xmm0, %xmm4        # expressed in four sequences of 5 instructions,     movaps %xmm4, %xmm5       # executed 4 times for 1 matrix multiplication.     addq $0x4, %rsi      movss (%rsi), %xmm4       # movss + shufps comprise _mm_set1_ps intrinsic     shufps $0x0, %xmm4, %xmm4 #     mulps %xmm1, %xmm4     addps %xmm4, %xmm5     addq $0x4, %rsi           # manual pointer arithmetic simplifies addressing      movss (%rsi), %xmm4     shufps $0x0, %xmm4, %xmm4     mulps %xmm2, %xmm4        # actual computation happens here     addps %xmm4, %xmm5        #     addq $0x4, %rsi      movss (%rsi), %xmm4       # one mulps operand fetched per sequence     shufps $0x0, %xmm4, %xmm4 #  |     mulps %xmm3, %xmm4        # the other is already waiting in %xmm[0-3]     addps %xmm4, %xmm5     addq $0x4, %rsi           # 5 preceding comments stride among the 4 blocks      movaps %xmm5, (%rdx,%rcx) # store the resulting row, actually, a column     addq $0x10, %rcx          # (matrices are stored in column-major order)     cmpq $0x40, %rcx     jne .ROW     ret .size matrixMultiplyASM, .-matrixMultiplyASM

It calculates a whole column of the resultant matrix per iteration, by processing four floats packed in 128-bit SSE registers. The full vectorisation is possible with a bit of math (operation reordering and aggregation) and mullps/addps instructions for parallel multiplication/addition of 4xfloat packages. The code reuses registers meant for passing parameters (%rdi, %rsi, %rdx : GNU/Linux ABI), benefits from (inner) loop unrolling and holds one matrix entirely in XMM registers to reduce memory reads. A you can see, I have researched the topic and took my time to implement it the best I can.

The naive C calculation conquering my code looks like this:

void matrixMultiplyNormal(mat4_t *mat_a, mat4_t *mat_b, mat4_t *mat_r) {     for (unsigned int i = 0; i < 16; i += 4)         for (unsigned int j = 0; j < 4; ++j)             mat_r->m[i + j] = (mat_b->m[i + 0] * mat_a->m[j +  0])                             + (mat_b->m[i + 1] * mat_a->m[j +  4])                             + (mat_b->m[i + 2] * mat_a->m[j +  8])                             + (mat_b->m[i + 3] * mat_a->m[j + 12]); }

I have investigated the optimised assembly output of the above's C code which, while storing floats in XMM registers, does not involve any parallel operations – just scalar calculations, pointer arithmetic and conditional jumps. The compiler's code seems to be less deliberate, but it is still slightly more effective than my vectorised version expected to be about 4x faster. I'm sure that the general idea is correct – programmers do similar things with rewarding results. But what is wrong here? Are there any register allocation or instruction scheduling issues I am not aware of? Do you know any x86-64 assembly tools or tricks to support my battle against the machine?

Z boson · Accepted Answer

4x4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts). The following code will do this for you. It only requires SSE (#include <xmmintrin.h>). The arrays A, B, and C need to be 16 byte aligned. Using horizontal instructions such as hadd (SSE3) and dpps (SSE4.1) will be less efficient (especially dpps). I don't know if loop unrolling will help.

void M4x4_SSE(float *A, float *B, float *C) {     __m128 row1 = _mm_load_ps(&B[0]);     __m128 row2 = _mm_load_ps(&B[4]);     __m128 row3 = _mm_load_ps(&B[8]);     __m128 row4 = _mm_load_ps(&B[12]);     for(int i=0; i<4; i++) {         __m128 brod1 = _mm_set1_ps(A[4*i + 0]);         __m128 brod2 = _mm_set1_ps(A[4*i + 1]);         __m128 brod3 = _mm_set1_ps(A[4*i + 2]);         __m128 brod4 = _mm_set1_ps(A[4*i + 3]);         __m128 row = _mm_add_ps(                     _mm_add_ps(                         _mm_mul_ps(brod1, row1),                         _mm_mul_ps(brod2, row2)),                     _mm_add_ps(                         _mm_mul_ps(brod3, row3),                         _mm_mul_ps(brod4, row4)));         _mm_store_ps(&C[4*i], row);     } }

Krzysztof Abramowicz · Answer

There is a way to accelerate the code and outplay the compiler. It does not involve any sophisticated pipeline analysis or deep code micro-optimisation (which doesn't mean that it couldn't further benefit from these). The optimisation uses three simple tricks:

The function is now 32-byte aligned (which significantly boosted performance),
Main loop goes inversely, which reduces comparison to a zero test (based on EFLAGS),
Instruction-level address arithmetic proved to be faster than the "external" pointer calculation (even though it requires twice as much additions «in 3/4 cases»). It shortened the loop body by four instructions and reduced data dependencies within its execution path. See related question.

Additionally, the code uses a relative jump syntax which suppresses symbol redefinition error, which occurs when GCC tries to inline it (after being placed within asm statement and compiled with -O3).

    .text     .align 32                           # 1. function entry alignment     .globl matrixMultiplyASM            #    (for a faster call)     .type matrixMultiplyASM, @function matrixMultiplyASM:     movaps   (%rdi), %xmm0     movaps 16(%rdi), %xmm1     movaps 32(%rdi), %xmm2     movaps 48(%rdi), %xmm3     movq $48, %rcx                      # 2. loop reversal 1:                                      #    (for simpler exit condition)     movss (%rsi, %rcx), %xmm4           # 3. extended address operands     shufps $0, %xmm4, %xmm4             #    (faster than pointer calculation)     mulps %xmm0, %xmm4     movaps %xmm4, %xmm5     movss 4(%rsi, %rcx), %xmm4     shufps $0, %xmm4, %xmm4     mulps %xmm1, %xmm4     addps %xmm4, %xmm5     movss 8(%rsi, %rcx), %xmm4     shufps $0, %xmm4, %xmm4     mulps %xmm2, %xmm4     addps %xmm4, %xmm5     movss 12(%rsi, %rcx), %xmm4     shufps $0, %xmm4, %xmm4     mulps %xmm3, %xmm4     addps %xmm4, %xmm5     movaps %xmm5, (%rdx, %rcx)     subq $16, %rcx                      # one 'sub' (vs 'add' & 'cmp')     jge 1b                              # SF=OF, idiom: jump if positive     ret

This is the fastest x86-64 implementation I have seen so far. I will appreciate, vote up and accept any answer providing a faster piece of assembly for that purpose!

Efficient 4x4 matrix multiplication (C vs assembly)

Tags:

Krzysztof Abramowicz

2 Answers

Z boson

Krzysztof Abramowicz

Recent Activity

Donate For Us

Efficient 4x4 matrix multiplication (C vs assembly)

Tags:

Krzysztof Abramowicz

2 Answers

Z boson

Krzysztof Abramowicz

Related questions

Recent Activity

Donate For Us