ARM Thumb/Thumb-2 performance

Tags:

cortex-m3

I am working on an ARM Cortex-M3 controller which has the Thumb-2 instruction set.

Thumb mode is used to compress the instruction to a 16-bit size. So size of code is reduced. But with normal Thumb mode, why is it said that performance is reduced?

In case of Thumb-2, it is said performance is improved as per these two links:

Wikipedia
Arm.com

Improve performance in cases where a single 16-bit instruction restricts functions available to the compiler.

A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

What exactly is this performance? Can someone give a few examples related to it?

582

asked Apr 06 '13 03:04

2 Answers

When compared against the ARM 32 bit instruction set, the thumb 16 bit instruction set (not talking about thumb2 extensions yet) takes less space because the instructions are half the size, but there is a performance drop, in general, because it takes more instructions to do the same thing as on arm. There are less features to the instruction set, and most instructions only operate on registers r0-r7. Apples to Apples comparison more instructions to do the same thing is slower.

Now thumb2 extensions take formerly undefined thumb instructions and create 32 bit thumb instructions. Understand that there is more than one set of thumb2 extensions. ARMv6m adds a couple dozen perhaps. ARMv7m adds something like 150 instructions to the thumb instruction set, I dont know what ARMv8 or the future hold. So assuming ARMv7m, they have bridged the gap between what you can do in thumb and what you can do in ARM. So thumb2 is a reduced ARM instruction set as thumb is, but not as reduced. So it might still take more instructions to do the same thing in thumb2 (assume plus thumb) compared to ARM doing the same thing.

This gives a taste of the issue, a single instruction in arm and its equivalent in thumb.

Click to copy

ARM

and r8,r9,r10

THUMB

push {r0,r1}
mov r0,r8
mov r1,r9
and r0,r1
mov r1,r10
and r0,r1
mov r8,r0
pop {r0,r1}

Now a compiler wouldnt do that, the compiler would know it is targeting thumb and do things differently by choosing other registers. You still have fewer registers and fewer features per instruction:

Click to copy

mov r0,r1
and r0,r2

Still takes two instructions/execution cycles to and two registers together, without modifying the operands, and put the result in a third register. Thumb2 has a three register and so you are back to a single instruction using the thumb2 extensions. And that thumb2 instruction allows for r0-r15 on any of those three registers where thumb is limited to r0-r7.

Look at the ARMv5 Architectural Reference Manual, under each thumb instruction it shows you the equivalent ARM instruction. Then go to that ARM instruction and compare what you can do with that arm instruction that you cant do with the thumb instruction. It is a one way path the thumb instructions (not thumb2) have a one to one relationship with an ARM instruction. all thumb instructions have an equivalent arm instruction. but not all arm instructions have an equivalent thumb instruction. You should be able to see from this exercise the limitation on the compilers when using the thumb instruction set. Then get the ARMv7m Architectural Reference Manual and look at the instruction set, and compare the "all thumb variants" encodings (the ones that include ARMv4T) and the ones that are limited to ARMv6 and/or v7 and see the expansion of features between thumb and thumb2 as well as the thumb2 only instructions that have no thumb counterpart. This should clarify what the compilers have to work with between thumb and thumb2. You can then go so far as to compare thumb+thumb2 with the full blown ARM instructions (ARMv7 AR is that what it is called?). And see that thumb2 gets a lot closer to ARM, but you lose for example conditionals on every instruction, so conditional execution in thumb becomes comparisons with branching over code, where in ARM you can sometimes have an if-then-else without branching...

183

answered Sep 28 '22 08:09

old_timer

Thumb-2 introduced variable length instructions to the original Thumb; now instructions can be a mixture of 16-bit and 32-bit. That means you retain the size advantage of the original Thumb in everyday code, but now have access to almost the full ARM feature-set in more complex code, but without the ARM-interworking overhead previously incurred by Thumb.

Aside from the aforementioned access to the full register set from all register operations, Thumb-2 added back branchless conditional execution in the form of the IF-THEN (IT) block. The original Thumb removed the trademark ARM feature of conditional execution on nearly all instructions; this is now achieved in Thumb-2 by prepending the IT instruction with conditions for up to four succeeding instructions.

In addition, the instruction set itself has been vastly expanded; for example, the Cortex-M4F implements the DSP extension as well as the FPv4-SP floating point extension. In fact, I believe even NEON can be encoded in Thumb2.

answered Sep 28 '22 09:09

Tony K

Related questions
                            
                                How to cross compile with cmake + arm-none-eabi on windows?
                            
                                Initial state of program registers and stack on Linux ARM
                            
                                ARM Assembly - Basic Interrupt Handling
                            
                                ARM and NEON can work in parallel?
                            
                                arm-none-eabi-gcc : Printing float number using printf
                            
                                What is non-aligned access? (ARM/Keil)
                            
                                ARM. Access user R13 and R14 from Supervisor mode
                            
                                How to use C defines in ARM assembler
                            
                                SSE _mm_movemask_epi8 equivalent method for ARM NEON
                            
                                What is the most efficient way to monitor the number of context switches in linux kernel?
                            
                                is char signed or unsigned by default on iOS?
                            
                                Mach-O symbol stubs (IOS)
                            
                                Why are the return addresses of prefetch abort and data abort different in ARM exceptions?
                            
                                iOS assembly code
                            
                                Explanation of pad control functions in a Freescale processor?
                            
                                How to detect cold boot versus warm boot on an ARM processor?
                            
                                Which one is better, gcc or armcc for NEON optimizations?
                            
                                Cycles per instruction in delay loop on arm
                            
                                Detect ARM NEON availability in the preprocessor?
                            
                                How to emulate an ARM architecture under OSX 10.6 ("Snow Leopard")?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With