Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why sometimes Apple Accelerate framework is slow?


I am playing with C and Swift 3.0 code using vecLib and Accelerate framework from Apple as dynamic lib + my code in C lang based project and Swift playground.

And in situation with calling Apple's wrapper from framework of SIMD instruction with 1 or < 4 elements computation function like vvcospif() from framework is slower than simple standart cos(x * PI) when functions calls from loop near 1.000 times as example.

I know about difference between vvcospif() and cos(), I should use exactly vvcospif() for x * PI.

Example in playground, you can just copy code and run it:

import Cocoa
import Accelerate

func cosine_interpolate(alpha: Float, a: Float, b: Float) -> Float {
    let ft: Float = alpha * 3.1415927;
    let f: Float = (1 - cos(ft)) * 0.5;

    return a + f*(b - a);
}

var start: Date = NSDate() as Date

var interp: Float;

for index in 0..<1000 {
   interp = cosine_interpolate(alpha: 0.25, a: 1.0, b: 0.75)
}

var end = NSDate();
var timeInterval: Double = end.timeIntervalSince(start);

print("cosine_interpolate in \(timeInterval) seconds")

func fast_cosine_interpolate(alpha: Float, a: Float, b: Float) -> Float {
    var x: Float = alpha
    var count: Int32 = 1

    var result: Float = 0
    vvcospif(&result, &x, &count)

    let SINSIN_HALF_X: Float = (1 - result) * 0.5;

    return a + SINSIN_HALF_X * (b - a);
}

start = NSDate() as Date

for index in 0..<1000 {
    interp = fast_cosine_interpolate(alpha: 0.25, a: 1.0, b: 0.75)
}

end = NSDate();
timeInterval = end.timeIntervalSince(start);

print("fast_cosine_interpolate in \(timeInterval) seconds")

My question is:

Why vvcospif() is slow in this example?

May be because vvcospif() it is wrapper under Objective-C runtime and converting data structures / copying of memory from Intel SIMD -> Objective-C -> Swift runtime is slower then tiny cos()?

I also have performance issue with C code +

#include <Accelerate/Accelerate.h>

vvcospif(resultVector, inputVector, &count);

when inputVector and resultVector is small arrays with 1 or 2 elements or just float variable, and calls in loop with ~ 1.000.000 times.

cos(x * PI) computation time near 20 ms.

and

vvcospif(x) with processing one float or float array[2] - computation time near 80 ms! Where is Acceleration? :)

Yes, in Xcode I use compiler -O -whole-module-optimization optimisation with whole module opt. enabled.

like image 301
menangen Avatar asked Dec 18 '22 13:12

menangen


1 Answers

For a more detailed discussion with examples, see "Introduction to Fast Bezier (and Trying the Accelerate.framework)".

The first, fundamental problem is that non-inlined function calls are extremely expensive. You don't want function calls if you can possibly help it in performance-critical code. Within a module, the compiler can often inline functions for you, and parts of stdlib can be inlined for you. But when you start crossing module barriers, Swift generally cannot optimize out the call.

The point of SIMD functions is that you set up all your data in the right format, and then call them just one time. That way the cost of the function call is made up by the SIMD optimized code you're calling.

But remember, you don't have to call into Accelerate to get SIMD optimizations. The compiler is perfectly capable of noticing you've written a loop and turning it into an inline SIMD algorithm itself (and it does this all the time). So for many simple problems, the compiler is going to win anyway. Think about it: if calling vvcospif with a count of 1 were faster than calling cos, wouldn't they just implement cos that way?

I haven't played with your code much, but if you want to improve its performance with Accelerate, you want to think about how to arrange all your input data so you can call vvcospif one time with a large N. It's quite possible in that case that it will be much faster that a loop (since cos is not trivial).

If you want an example of Accelerate in practice, and how you need to organize your data, see PinchText. This code is computing offsets for a page full of a few thousand glyphs based on up to 10 touches in real-time, with animations (see PinchText.mov for what the result looks like). In particular look at adjustViewPositions:count:forTouchPoint:. Notice how count is large, and the data is transformed step by step with no loops. Even throwing in a (very expensive) ObjC method call into that method doesn't matter very much because it's only made one time. Getting rid of function calls in loops is a huge part of performance programming.

like image 146
Rob Napier Avatar answered Jan 11 '23 10:01

Rob Napier