I read a bit about using SSE intrinsics and tried my luck with implementing quaternion rotation with doubles. Below are the normal and SSE functions I wrote,
void quat_rot(quat_t a, REAL* restrict b){
///////////////////////////////////////////
// Multiply vector b by quaternion a //
///////////////////////////////////////////
REAL cross_temp[3],result[3];
cross_temp[0]=a.el[2]*b[2]-a.el[3]*b[1]+a.el[0]*b[0];
cross_temp[1]=a.el[3]*b[0]-a.el[1]*b[2]+a.el[0]*b[1];
cross_temp[2]=a.el[1]*b[1]-a.el[2]*b[0]+a.el[0]*b[2];
result[0]=b[0]+2.0*(a.el[2]*cross_temp[2]-a.el[3]*cross_temp[1]);
result[1]=b[1]+2.0*(a.el[3]*cross_temp[0]-a.el[1]*cross_temp[2]);
result[2]=b[2]+2.0*(a.el[1]*cross_temp[1]-a.el[2]*cross_temp[0]);
b[0]=result[0];
b[1]=result[1];
b[2]=result[2];
}
With SSE
inline void cross_p(__m128d *a, __m128d *b, __m128d *c){
const __m128d SIGN_NP = _mm_set_pd(0.0, -0.0);
__m128d l1 = _mm_mul_pd( _mm_unpacklo_pd(a[1], a[1]), b[0] );
__m128d l2 = _mm_mul_pd( _mm_unpacklo_pd(b[1], b[1]), a[0] );
__m128d m1 = _mm_sub_pd(l1, l2);
m1 = _mm_shuffle_pd(m1, m1, 1);
m1 = _mm_xor_pd(m1, SIGN_NP);
l1 = _mm_mul_pd( a[0], _mm_shuffle_pd(b[0], b[0], 1) );
__m128d m2 = _mm_sub_sd(l1, _mm_unpackhi_pd(l1, l1));
c[0] = m1;
c[1] = m2;
}
void quat_rotSSE(quat_t a, REAL* restrict b){
///////////////////////////////////////////
// Multiply vector b by quaternion a //
///////////////////////////////////////////
__m128d axb[2];
__m128d aa[2];
aa[0] = _mm_load_pd(a.el+1);
aa[1] = _mm_load_sd(a.el+3);
__m128d bb[2];
bb[0] = _mm_load_pd(b);
bb[1] = _mm_load_sd(b+2);
cross_p(aa, bb, axb);
__m128d w = _mm_set1_pd(a.el[0]);
axb[0] = _mm_add_pd(axb[0], _mm_mul_pd(w, bb[0]));
axb[1] = _mm_add_sd(axb[1], _mm_mul_sd(w, bb[1]));
cross_p(aa, axb, axb);
_mm_store_pd(b, _mm_add_pd(bb[0], _mm_add_pd(axb[0], axb[0])));
_mm_store_sd(b+2, _mm_add_pd(bb[1], _mm_add_sd(axb[1], axb[1])));
}
The rotation is basically done using the function,
I then run the following test to check how much time each function takes to do a set of rotations,
int main(int argc, char *argv[]){
REAL a[] __attribute__ ((aligned(16))) = {0.2, 1.3, 2.6};
quat_t q = {{0.1, 0.7, -0.3, -3.2}};
REAL sum = 0.0;
for(int i = 0; i < 4; i++) sum += q.el[i] * q.el[i];
sum = sqrt(sum);
for(int i = 0; i < 4; i++) q.el[i] /= sum;
int N = 1000000000;
for(int i = 0; i < N; i++){
quat_rotSSE(q, a);
}
printf("rot = ");
for(int i = 0; i < 3; i++) printf("%f, ", a[i]);
printf("\n");
return 0;
}
I compiled using gcc 4.6.3 with -O3 -std=c99 -msse3.
The timings for the normal function, using the unix time
, was 18.841s and 21.689s for the SSE one.
Am I missing something, why is the SSE implementation 15% slower than the normal one? In which cases would an SSE implementation be faster for double precision?
EDIT: Taking advice from the comments, I tried several things,
restrict
on the cross_p
function and added an __m128d to hold the second cross product. This had no difference in the assembly produced.movapd
.The assembly code generated for the SSE function is only 4 lines less than the normal one.
EDIT: Added links to the assembly generated,
quat_rot
quat_rotSSE
SSE (and SIMD in general) works really well when you're performing the same operations on a large number of elements, where there's no dependencies between operations. For example, if you had an array of double and needed to do array[i] = (array[i] * K + L)/M + N;
for each element then SSE/SIMD would help.
If you're not performing the same operations on a large number of elements, then SSE doesn't help. For example, if you had one double and needed to do foo = (foo * K + L)/M + N;
then SSE/SIMD isn't going to help.
Basically, SSE is the wrong tool for the job. You need to change the job into something where SSE is the right tool. For example, rather than multiplying one vector by one quaternion; try multiplying an array of 1000 vectors by a quaternion, or perhaps multiplying an array of 1000 vectors by an array of 1000 quaternions.
EDIT: Added everything below here!
Note that this typically means modifying data structures to suit. For example, rather than having an array of structures it's often better to have a structure of arrays.
For a better example, imagine your code uses an array of quaternions, like this:
for(i = 0; i < quaternionCount; i++) {
cross_temp[i][0] = a[i][2] * b[i][2] - a[i][3] * b[i][1] + a[i][0] * b[i][0];
cross_temp[i][1] = a[i][3] * b[i][0] - a[i][1] * b[i][2] + a[i][0] * b[i][1];
cross_temp[i][2] = a[i][1] * b[i][1] - a[i][2] * b[i][0] + a[i][0] * b[i][2];
b[i][0] = b[i][0] + 2.0 * (a[i][2] * cross_temp[i][2] - a[i][3] * cross_temp[i][1]);
b[i][1] = b[i][1] + 2.0 * (a[i][3] * cross_temp[i][0] - a[i][1] * cross_temp[i][2]);
b[i][2] = b[i][2] + 2.0 * (a[i][1] * cross_temp[i][1] - a[i][2] * cross_temp[i][0]);
}
The first step would be to transform it into a quaternion of arrays, and do this:
for(i = 0; i < quaternionCount; i++) {
cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
}
Then, because 2 adjacent doubles fit into a single SSE register you want to unroll the loop by 2:
for(i = 0; i < quaternionCount; i += 2) {
cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
cross_temp[0][i+1] = a[2][i+1] * b[2][i+1] - a[3][i+1] * b[1][i+1] + a[0][i+1] * b[0][i+1];
cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
cross_temp[1][i+1] = a[3][i+1] * b[0][i+1] - a[1][i+1] * b[2][i+1] + a[0][i+1] * b[1][i+1];
cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
cross_temp[2][i+1] = a[1][i+1] * b[1][i+1] - a[2][i+1] * b[0][i+1] + a[0][i+1] * b[2][i+1];
b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
b[0][i+1] = b[0][i+1] + 2.0 * (a[2][i+1] * cross_temp[2][i+1] - a[3][i+1] * cross_temp[1][i+1]);
b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
b[1][i+1] = b[1][i+1] + 2.0 * (a[3][i+1] * cross_temp[0][i+1] - a[1][i+1] * cross_temp[2][i+1]);
b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
b[2][i+1] = b[2][i+1] + 2.0 * (a[1][i+1] * cross_temp[1][i+1] - a[2][i+1] * cross_temp[0][i+1]);
}
Now, you want to break this down into individual operations. For example, the first 2 lines of the inner loop would become:
cross_temp[0][i] = a[2][i] * b[2][i];
cross_temp[0][i] -= a[3][i] * b[1][i];
cross_temp[0][i] += a[0][i] * b[0][i];
cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];
Now re-order that:
cross_temp[0][i] = a[2][i] * b[2][i];
cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
cross_temp[0][i] -= a[3][i] * b[1][i];
cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
cross_temp[0][i] += a[0][i] * b[0][i];
cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];
Once you've done all of this, think about converting to SSE. The first 2 lines of code is one load (that loads both a[2][i]
and a[2][i+1]
into an SSE register) followed by one multiplication (and not 2 separate loads and 2 separate multiplications). Those 6 lines might become (pseudo-code):
load SSE_register1 with both a[2][i] and a[2][i+1]
multiply SSE_register1 with both b[2][i] and b[2][i+1]
load SSE_register2 with both a[3][i] and a[3][i+1]
multiply SSE_register2 with both b[1][i] and b[1][i+1]
load SSE_register2 with both a[0][i] and a[0][i+1]
multiply SSE_register2 with both b[0][i] and b[0][i+1]
SE_register1 = SE_register1 - SE_register2
SE_register1 = SE_register1 + SE_register3
Each line of pseudo-code here is a single SSE instruction/intrinsic; and every SSE instruction/intrinsic is doing 2 operations in parallel.
If every instruction does 2 operations in parallel, then (in theory) it could be up to twice as fast as the original "one operation per instruction" code.
Some ideas that would perhaps enable full optimization of your code.
restrict
specifications to cross_p
to avoid the
compiler reloading of parameters several times__m128d
variable that will receive the result of the second call to cross_p
Then look into the assembler (gcc option -S) to see what is produced by all of this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With