I'm working on optimizing a 4D (128 Bit) matrix-vector multiplication using ARM NEON Assembler.
If I load the matrix, and the vector into the NEON Registers and transform it, I won't get a great performance boost, because the switch to the NEON Registers cost 20 cycles. Furthermore I reload the matrix for each multiplication, despite it has not changed.
There is enough register-space to perform the transformation on more vectors a time. This IS increasing performance.
But..
I'm wondering how fast this operation would be if I do the loop over all vertices (increasing pointers) within the assembler. But I am at the very beginning of Neon assembler and though don't know how to do this. Can someone give me an hand on that?
What I want to achieve:
existing C-Version of loop:
void TransformVertices(ESMatrix* m, GLfloat* vertices, GLfloat* normals, int count)
{
GLfloat* pVertex = vertices;
int i;
// iterate trough vertices only one at a time
for (i = 0; i < count ; i ++)
{
Matrix4Vector4Mul( (float *)m, (float *)pVertex, (float *)pVertex);
pVertex += 4;
}
//LoadMatrix( (const float*) m);
//// two at a time
//for (i = 0; i < count ; i += 2)
//{
// Matrix4Vector4Mul2( (float *)m, (float *)pVertex, (float *)(pVertex + 4));
// pVertex += 8;
//}
}
Following code for NEON-Version on doing only one transformation:
void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{
asm volatile
(
"vldmia %1, {q1-q4 } \n\t"
"vldmia %2, {q5} \n\t"
"vmul.f32 q0, q1, d10[0] \n\t"
"vmla.f32 q0, q2, d10[1] \n\t"
"vmla.f32 q0, q3, d11[0] \n\t"
"vmla.f32 q0, q4, d11[1] \n\t"
"vstmia %0, {q0}"
: // no output
: "r" (vOut), "r" (m), "r" (vIn)
: "memory", "q0", "q1", "q2", "q3", "q4", "q5"
);
}
C-Version of transformation:
void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{
Vertex4D* v1 = (Vertex4D*)vIn;
Vertex4D vOut1;
Vertex4D* l0;
Vertex4D* l1;
Vertex4D* l2;
Vertex4D* l3;
// 4x4 Matrix with members m00 - m33
ESMatrix* m1 = (ESMatrix*)m;
l0 = (Vertex4D*)&m1->m00;
vOut1.x = l0->x * v1->x;
vOut1.y = l0->y * v1->x;
vOut1.z = l0->z * v1->x;
vOut1.w = l0->w * v1->x;
l1 = (Vertex4D*)&m1->m10;
vOut1.x += l1->x * v1->y;
vOut1.y += l1->y * v1->y;
vOut1.z += l1->z * v1->y;
vOut1.w += l1->w * v1->y;
l2 = (Vertex4D*)&m1->m20;
vOut1.x += l2->x * v1->z;
vOut1.y += l2->y * v1->z;
vOut1.z += l2->z * v1->z;
vOut1.w += l2->w * v1->z;
l3 = (Vertex4D*)&m1->m30;
vOut1.x += l3->x * v1->w;
vOut1.y += l3->y * v1->w;
vOut1.z += l3->z * v1->w;
vOut1.w += l3->w * v1->w;
*(vOut) = vOut1.x;
*(vOut + 1) = vOut1.y;
*(vOut + 2) = vOut1.z;
*(vOut + 3) = vOut1.w;
}
Performance: (Transform > 90 000 Vertices | Android 4.0.4 SGS II)
C-Version: 190 FPS
NEON-Version: 162 FPS ( .. slower -.- )
--- LOAD Matrix only ONCE (seperate ASM) and then perform two V's at a time ---
NEON-Version: 217 FPS ( + 33 % NEON | + 14 % C-Code )
Did you try playing with compiler flags?
-mcpu=cortex-a9 -mtune=cortex-a9 -mfloat-abi=softfp -mfpu=neon -O3
does pretty job for me in this case (gcc 4.4.3, distributed with Android NDK 8b). Try to have tight source code by defining internal functions static and inline as well as moving matrix (m[X][0] stuff) to static global variables or just merge Matrix4Vector4Mul into loop and make matrix local variables instead of keep passing it in function - gcc doesn't get smart there.
When I do this, I get below for the main loop.
a4: ed567a03 vldr s15, [r6, #-12]
a8: ee276aa0 vmul.f32 s12, s15, s1
ac: ee676aa8 vmul.f32 s13, s15, s17
b0: ed564a04 vldr s9, [r6, #-16]
b4: ee277a88 vmul.f32 s14, s15, s16
b8: ed165a02 vldr s10, [r6, #-8]
bc: ee677a80 vmul.f32 s15, s15, s0
c0: ed565a01 vldr s11, [r6, #-4]
c4: e2833001 add r3, r3, #1
c8: ee046a89 vmla.f32 s12, s9, s18
cc: e1530004 cmp r3, r4
d0: ee446aaa vmla.f32 s13, s9, s21
d4: ee047a8a vmla.f32 s14, s9, s20
d8: ee447aa9 vmla.f32 s15, s9, s19
dc: ee056a22 vmla.f32 s12, s10, s5
e0: ee456a01 vmla.f32 s13, s10, s2
e4: ee057a21 vmla.f32 s14, s10, s3
e8: ee457a02 vmla.f32 s15, s10, s4
ec: ee056a8b vmla.f32 s12, s11, s22
f0: ee456a83 vmla.f32 s13, s11, s6
f4: ee057aa3 vmla.f32 s14, s11, s7
f8: ee457a84 vmla.f32 s15, s11, s8
fc: ed066a01 vstr s12, [r6, #-4]
100: ed466a04 vstr s13, [r6, #-16]
104: ed067a03 vstr s14, [r6, #-12]
108: ed467a02 vstr s15, [r6, #-8]
10c: e2866010 add r6, r6, #16
110: 1affffe3 bne a4 <TransformVertices+0xa4>
Having 4 loads, 4 multiplies, 12 multiply and accumulates and 4 stores which matches with what you are doing in Matrix4Vector4Mul.
If you are still not satisfied with compiler generated code, pass compiler '-S' to get assembly output and use that as a starting point to improve further instead of starting from scratch.
You should also check that vertices
is cache line size aligned (32 bytes for Cortex-A9) to get a nice data flow.
For vectorization there are gcc options like -ftree-vectorizer-verbose=9
to print information what was vectorized. Also search in gcc documentation this one to see how you can direct gcc or what you need to modify to get your multiplications vectorized. This might sound a lot to dig in but it would be more fruitful for you in long run than 'hand vectorizing'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With