Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARM assembly loop

for (int i = 0; i < 10000; i++)
  a[i] = b[i] + c[i]

What does the ARM assembly for this high level language look like?

Edit: I'm also assuming the base address of A is in R8, the base address of B is in R9, and the base address of C is in R10 and A,B,C are all int arrays

Much appreciated

I tried:

MOV  R0, #0  ; Init r0 (i = 0)

Loop:

        a[i] = b[i] + c[i]   //How to fix this? 

        ADD  R0, R0, #1 ;Increment it

        CMP  R0, #1000 ;Check the limit

        BLE  Loop  ;Loop if not finished
like image 892
CyberShot Avatar asked Aug 16 '12 02:08

CyberShot


4 Answers

Assuming this high level language doesn't have anything conflicting with C, you can use an arm C compiler to create assembly code from your snippet. For example if you have the following in test.c,

void test() {
        register int i asm("r0");
        register int *a asm("r8");
        register int *b asm("r9");
        register int *c asm("r10");

        for (i = 0; i < 10000; i++) {
                a[i] = b[i] + c[i];
        }
}

you can run

arm-linux-androideabi-gcc -O0 -S test.c

to create a test.s file, which will contain assembly code for your test function as well as some extra stuff. You can see how your loop got compiled into to assembly below.

<snipped>
.L3:
        mov     r2, r8
        mov     r3, r0
        mov     r3, r3, asl #2
        add     r3, r2, r3
        mov     r1, r9
        mov     r2, r0
        mov     r2, r2, asl #2
        add     r2, r1, r2
        ldr     r1, [r2, #0]
        mov     ip, sl
        mov     r2, r0
        mov     r2, r2, asl #2
        add     r2, ip, r2
        ldr     r2, [r2, #0]
        add     r2, r1, r2
        str     r2, [r3, #0]
        mov     r3, r0
        add     r3, r3, #1
        mov     r0, r3
.L2:
        mov     r2, r0
        ldr     r3, .L5
        cmp     r2, r3
        ble     .L3
        sub     sp, fp, #12
        ldmfd   sp!, {r8, r9, sl, fp}
        bx      lr
<snipped>

Now the problem with this approach is trusting the compiler generates the optimal code for your study, which might not be always the case but what you'll get is fast answers to your questions like above instead of waiting for people :)

-- extra --

GCC allows you to put variables into certain registers, see related documentation.

You can get arm instruction cheat sheet here.

Newer versions of GCC creates better arm code as one would expected. Above snipped is generated by version 4.4.3, and I can confirm Linaro's 4.7.1 proves my claim. So if you take my approach use the most recent tool chain you can get.

like image 100
auselen Avatar answered Oct 03 '22 17:10

auselen


http://www.peter-cockerell.net/aalp/html/ch-5.html

;Print characters 32..126 using a FOR loop-type construct

;R0 holds the character
MOV  R0, #32  ;Init the character
.loop
SWI  WriteC  ;Print it
ADD  R0, R0, #1 ;Increment it
CMP  R0, #126 ;Check the limit
BLE  loop  ;Loop if not finished
;
like image 34
alpera Avatar answered Oct 03 '22 18:10

alpera


for (int i = 0; i < 10000; i++)
  a[i] = b[i] + c[i]



mov r0,#0x2700
orr r0,#0x0010
top:
ldr r1,[r9],#4
ldr r2,[r10],#4
add r1,r1,r2
str r1,[r8],#4
subs r0,#1
bne top
like image 35
old_timer Avatar answered Oct 03 '22 19:10

old_timer


To build upon @alpera 's answer - you could also unroll the loop to do 4 ops at once - although whether you get a performance benefit depends whether the memory access or the pipeline stall around the branch is the bigger effect

mov r11,#0x2700
orr r11,#0x0010
top:
ldmia r9!, {r0-r3}
ldmia r10!, {r4-r7}
add r0,r0,r4
add r1,r1,r5
add r2,r2,r6
add r3,r3,r7
stmia r8!, {r0-r3}
subs r11,#4
bne top

If you have NEON unit handy, we could do it that way too - in which case it will parallelize the loads, stores and adds - in effect reducing the problem to 5 instructions that perform two iterations of the loop at once.

A C compiler is will not generate code this tight by default (or paralleize for NEON) as it must assume that the buffers used for reading and writing (r8,r10 and r11) can potentially overlap - hence a write through r8 might immediately be read in the next iteration of the loop through r9 or r10. You can use the restrict (__restrict in C++) modifier to tell the compiler that this is not the case.

like image 20
marko Avatar answered Oct 03 '22 17:10

marko