I’ve been looking into the Accelerate framework that was made available in iOS 4. Specifically, I made some attempts to use the Cblas routines in my linear algebra library in C. Now I can’t get the use of these functions to give me any performance gain over very basic routines. Specifically, the case of 4x4 matrix multiplication. Wherever I couldn’t make use of affine or homogeneous properties of the matrices, I’ve been using this routine (abridged):
float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
target[1] = ...etc...
...
target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
return target;
}
The equivalent function call for Cblas is:
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);
Comparing the two, by making them run through a large number of pre-computed matrices filled with random numbers (each function gets the exact same input every time), the Cblas routine performs about 4x slower, when timed with the C clock() function.
This doesn’t seem right to me, and I’m left with the feeling that I’m doing something wrong somewhere. Do I have to to enable the device’s NEON unit and SIMD functionality somehow? Or shouldn’t I hope for better performance with such small matrices?
Very much appreciated,
Bastiaan
The Apple WWDC2010 presentations say that Accelerate should still give a speedup for even a 3x3 matrix operation, so I would have assumed you should see a slight improvement for 4x4. But something you need to consider is that Accelerate & NEON are designed to greatly speed up integer operations but not necessarily floating-point operations. You didn't mention your CPU processor, and it seems that Accelerate will use either NEON or VFP for floating-point operations depending on your CPU. If it uses NEON instructions for 32-bit float operations then it should run fast, but if it uses VFP for 32-bit float or 64-bit double operations, then it will run very slowly (since VFP is not actually SIMD). So you should make sure that you are using 32-bit float operations with Accelerate, and make sure it will use NEON instead of VFP.
And another issue is that even if it does use NEON, there is no guarantee that your C compiler will generate faster NEON code than your simple C function does without NEON instructions, because C compilers such as GCC often generate terrible SIMD code, potentially running slower than standard code. Thats why its always important to test the speed of the generated code, and possibly to manually look at the generated assembly code to see if your compiler generated bad code.
The BLAS and LAPACK libraries are designed for use with what I would consider "medium to large matrices" (from tens to tens of thousands on a side). They will deliver correct results for smaller matrices, but the performance will not be as good as it could be.
There are several reasons for this:
What this means for you: if you would like to have dedicated small matrix operations provided, please go to bugreport.apple.com and file a bug requesting this feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With