I done computing intensive app using <code>OpenCV</code> for <code>iOS</code>. Of course it was slow. But it was something like 200 times slower than my PC prototype. So I was optimizing it down. From very first 15 seconds I was able to get 0.4 seconds speed. I wonder if I found all things and what others may want to share. What I did: <ol> <li>Replaced "<code>double</code>" data types inside OpenCV to "<code>float</code>". Double is 64bit and 32bit CPU cannot easily handle them, so float gave me some speed. OpenCV uses double very often.</li> <li>Added "<code>-mpfu=neon</code>" to compiler options. Side-effect was new problem that emulator compiler does not work anymore and anything can be tested on native hardware only.</li> <li>Replaced <code>sin()</code> and <code>cos()</code> implementation with 90 values lookup tables. Speedup was huge! This is somewhat opposite to PC where such optimizations does not give any speedup. There was code working in degrees and this value was converted to radians for <code>sin()</code> and <code>cos()</code>. This code was removed too. But lookup tables did the job.</li> <li>Enabled <code>"thumb optimizations"</code>. Some blog posts recommend exactly opposite but this is because thumb makes things usually slower on <code>armv6</code>. <code>armv7</code> is free of any problems and makes things just faster and smaller. </li> <li>To make sure thumb optimizations and <code>-mfpu=neon</code> work at best and do not introduce crashes I removed armv6 target completely. All my code is compiled to <code>armv7</code> and this is also listed as requirement in app store. This means minimum <code>iPhone</code> will be <code>3GS</code>. I think it is OK to drop older ones. Anyway older ones have slower CPUs and CPU intensive app provides bad user experience if installed on old device.</li> <li>Of course I use <code>-O3 flag</code></li> <li>I deleted <code>"dead code"</code> from OpenCV. Often when optimizing OpenCV I see code which is clearly not needed for my project. For example often there is a extra <code>"if()"</code> to check for pixel size being 8 bit or 32 bit and I know that I need 8bit only. This removes some code, provides optimizer better chance to remove something more or replace with constants. Also code fits better into cache.</li> </ol> Any other tricks and ideas? For me enabling thumb and replacing trigonometry with lookups were boost makers and made me surprise. Maybe you know something more to do which makes apps fly?

I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV: <pre class="prettyprint"><code>for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) { vec[kk] = 0; } </code></pre> I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this: <pre class="prettyprint"><code>__asm__("#start"); for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) { vec[kk] = 0; } __asm__("#stop"); </code></pre> Now I press "Product -> Generate Output -> Assembly file" and what I get is: <pre class="prettyprint"><code> @ InlineAsm Start #start @ InlineAsm End Ltmp1915: ldr r0, [sp, #84] movs r1, #0 ldr r0, [r0, #16] ldr r0, [r0, #28] cmp r0, #4 mov r0, r4 blo LBB14_71 LBB14_70: Ltmp1916: ldr r3, [sp, #84] movs r2, #0 Ltmp1917: str r2, [r0], #4 adds r1, #1 Ltmp1918: Ltmp1919: ldr r2, [r3, #16] ldr r2, [r2, #28] lsrs r2, r2, #2 cmp r2, r1 bgt LBB14_70 LBB14_71: Ltmp1920: add.w r0, r4, #8 @ InlineAsm Start #stop @ InlineAsm End </code></pre> A lot of code. I printf-d out value of <code>(int)(descriptors->elem_size/sizeof(vec[0]))</code> and it was always 64. So I hardcoded it to be 64 and passed again via assembler: <pre class="prettyprint"><code> @ InlineAsm Start #start @ InlineAsm End Ltmp1915: vldr.32 s16, LCPI14_7 mov r0, r4 movs r1, #0 mov.w r2, #256 blx _memset @ InlineAsm Start #stop @ InlineAsm End </code></pre> As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed. I also tried Accelerate as suggested replacing three lines with: <pre class="prettyprint"><code>__asm__("#start"); vDSP_vclr(vec,1,64); __asm__("#stop"); </code></pre> Assembly now looks: <pre class="prettyprint"><code> @ InlineAsm Start #start @ InlineAsm End Ltmp1917: str r1, [r7, #-140] Ltmp1459: Ltmp1918: movs r1, #1 movs r2, #64 blx _vDSP_vclr Ltmp1460: Ltmp1919: add.w r0, r4, #8 @ InlineAsm Start #stop @ InlineAsm End </code></pre> Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed. One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework

Maximum speed from IOS/iPad/iPhone

Tags:

xcode

ios

opencv

iphone

I done computing intensive app using OpenCV for iOS. Of course it was slow. But it was something like 200 times slower than my PC prototype. So I was optimizing it down. From very first 15 seconds I was able to get 0.4 seconds speed. I wonder if I found all things and what others may want to share. What I did:

Replaced "double" data types inside OpenCV to "float". Double is 64bit and 32bit CPU cannot easily handle them, so float gave me some speed. OpenCV uses double very often.
Added "-mpfu=neon" to compiler options. Side-effect was new problem that emulator compiler does not work anymore and anything can be tested on native hardware only.
Replaced sin() and cos() implementation with 90 values lookup tables. Speedup was huge! This is somewhat opposite to PC where such optimizations does not give any speedup. There was code working in degrees and this value was converted to radians for sin() and cos(). This code was removed too. But lookup tables did the job.
Enabled "thumb optimizations". Some blog posts recommend exactly opposite but this is because thumb makes things usually slower on armv6. armv7 is free of any problems and makes things just faster and smaller.
To make sure thumb optimizations and -mfpu=neon work at best and do not introduce crashes I removed armv6 target completely. All my code is compiled to armv7 and this is also listed as requirement in app store. This means minimum iPhone will be 3GS. I think it is OK to drop older ones. Anyway older ones have slower CPUs and CPU intensive app provides bad user experience if installed on old device.
Of course I use -O3 flag
I deleted "dead code" from OpenCV. Often when optimizing OpenCV I see code which is clearly not needed for my project. For example often there is a extra "if()" to check for pixel size being 8 bit or 32 bit and I know that I need 8bit only. This removes some code, provides optimizer better chance to remove something more or replace with constants. Also code fits better into cache.

Any other tricks and ideas? For me enabling thumb and replacing trigonometry with lookups were boost makers and made me surprise. Maybe you know something more to do which makes apps fly?

304

asked Jun 27 '12 04:06

Tõnu Samuel

2 Answers

If you are doing a lot of floating point calculations, it would benefit you greatly to use Apple's Accelerate framework. It is designed to use the floating point hardware to do calculations on vectors in parallel.

I will also address your points one by one:

1) This is not because of the CPU, it is because as of the armv7-era only 32-bit floating point operations will be calculated in the floating point processor hardware (because apple replaced the hardware). 64-bit ones will be calculated in software instead. In exchange, 32-bit operations got much faster.

2) NEON is the name of the new floating point processor instruction set

3) Yes, this is a well known method. An alternative is to use Apple's framework that I mentioned above. It provides sin and cos functions that calculate 4 values in parallel. The algorithms are fine tuned in assembly and NEON so they give the maximum performance while using minimal battery.

4) The new armv7 implementation of thumb doesn't have the drawbacks of armv6. The disabling recommendation only applies to v6.

5) Yes, considering 80% of users are on iOS 5.0 or above now (armv6 devices ended support at 4.2.1), that is perfectly acceptable for most situations.

6) This happens automatically when you build in release mode.

7) Yes, this won't have as large an effect as the above methods though.

My recommendation is to check out Accelerate. That way you can make sure you are leveraging the full power of the floating point processor.

answered Nov 15 '22 22:11

borrrden

I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV:

for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
    vec[kk] = 0;
}

I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this:

__asm__("#start");
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
    vec[kk] = 0;
}
__asm__("#stop");

Now I press "Product -> Generate Output -> Assembly file" and what I get is:

    @ InlineAsm Start
    #start
    @ InlineAsm End
Ltmp1915:
    ldr r0, [sp, #84]
    movs    r1, #0
    ldr r0, [r0, #16]
    ldr r0, [r0, #28]
    cmp r0, #4
    mov r0, r4
    blo LBB14_71
LBB14_70:
Ltmp1916:
    ldr r3, [sp, #84]
    movs    r2, #0
Ltmp1917:
    str r2, [r0], #4
    adds    r1, #1
Ltmp1918:
Ltmp1919:
    ldr r2, [r3, #16]
    ldr r2, [r2, #28]
    lsrs    r2, r2, #2
    cmp r2, r1
    bgt LBB14_70
LBB14_71:
Ltmp1920:
    add.w   r0, r4, #8
    @ InlineAsm Start
    #stop
    @ InlineAsm End

A lot of code. I printf-d out value of (int)(descriptors->elem_size/sizeof(vec[0])) and it was always 64. So I hardcoded it to be 64 and passed again via assembler:

    @ InlineAsm Start
    #start
    @ InlineAsm End
Ltmp1915:
    vldr.32 s16, LCPI14_7
    mov r0, r4
    movs    r1, #0
    mov.w   r2, #256
    blx _memset
    @ InlineAsm Start
    #stop
    @ InlineAsm End

As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed.

I also tried Accelerate as suggested replacing three lines with:

__asm__("#start");
vDSP_vclr(vec,1,64);
__asm__("#stop");

Assembly now looks:

    @ InlineAsm Start
    #start
    @ InlineAsm End
Ltmp1917:
    str r1, [r7, #-140]
Ltmp1459:
Ltmp1918:
    movs    r1, #1
    movs    r2, #64
    blx _vDSP_vclr
Ltmp1460:
Ltmp1919:
    add.w   r0, r4, #8
    @ InlineAsm Start
    #stop
    @ InlineAsm End

Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed.

One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework

answered Nov 15 '22 22:11

Tõnu Samuel

Related questions
                            
                                UITableView section index overlaps search bar
                            
                                Alternate to control+drag to connect view element with file owner in xCode interface builder?
                            
                                HTML5 Webapp as regular icon or app on iPhone and Android.
                            
                                What is difference between these 2 macros?
                            
                                Seeing the value of a synthesized property in the Xcode debugger when there is no backing variable
                            
                                What version of GLSL is used in the iPhone(s)?
                            
                                Format integer to 2 places
                            
                                Can I trigger a mobile client to automatically launch a web browser when connecting to wifi?
                            
                                AVAudioRecorder / AVAudioPlayer - append recording to file
                            
                                Change text color in MoreNavigationController
                            
                                iPhone Facebook Video Upload
                            
                                Shuffling an array in objective-c [duplicate]
                            
                                how accurate is the altitude measurement in mobile phones
                            
                                Why am I not able to override isEqual in my NSManagedObject subclass?
                            
                                How to set malloc_error_break in Xcode4
                            
                                Finding the center of a CGPath
                            
                                How do I add extra plists in xCode in the Settings.bundle for use with the InAppSettingsKit iPhone library?
                            
                                Why does AVCaptureStillImageOutput jpegStillImageNSDataRepresentation throw an exception with a NULL sample buffer?
                            
                                What is a UIGobblerGestureRecognizer?
                            
                                issue in drawing line using core graphics : bubbles are shown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Maximum speed from IOS/iPad/iPhone

Tags:

xcode

ios

opencv

iphone

Tõnu Samuel

People also ask

2 Answers

borrrden

Tõnu Samuel

Recent Activity

Donate For Us