for (int i = 0; i < 10000; i++)
a[i] = b[i] + c[i]
What does the ARM assembly for this high level language look like?
Edit: I'm also assuming the base address of A is in R8, the base address of B is in R9, and the base address of C is in R10 and A,B,C are all int arrays
Much appreciated
I tried:
MOV R0, #0 ; Init r0 (i = 0)
Loop:
a[i] = b[i] + c[i] //How to fix this?
ADD R0, R0, #1 ;Increment it
CMP R0, #1000 ;Check the limit
BLE Loop ;Loop if not finished
Assuming this high level language doesn't have anything conflicting with C, you can use an arm C compiler to create assembly code from your snippet. For example if you have the following in test.c,
void test() {
register int i asm("r0");
register int *a asm("r8");
register int *b asm("r9");
register int *c asm("r10");
for (i = 0; i < 10000; i++) {
a[i] = b[i] + c[i];
}
}
you can run
arm-linux-androideabi-gcc -O0 -S test.c
to create a test.s file, which will contain assembly code for your test function as well as some extra stuff. You can see how your loop got compiled into to assembly below.
<snipped>
.L3:
mov r2, r8
mov r3, r0
mov r3, r3, asl #2
add r3, r2, r3
mov r1, r9
mov r2, r0
mov r2, r2, asl #2
add r2, r1, r2
ldr r1, [r2, #0]
mov ip, sl
mov r2, r0
mov r2, r2, asl #2
add r2, ip, r2
ldr r2, [r2, #0]
add r2, r1, r2
str r2, [r3, #0]
mov r3, r0
add r3, r3, #1
mov r0, r3
.L2:
mov r2, r0
ldr r3, .L5
cmp r2, r3
ble .L3
sub sp, fp, #12
ldmfd sp!, {r8, r9, sl, fp}
bx lr
<snipped>
Now the problem with this approach is trusting the compiler generates the optimal code for your study, which might not be always the case but what you'll get is fast answers to your questions like above instead of waiting for people :)
-- extra --
GCC allows you to put variables into certain registers, see related documentation.
You can get arm instruction cheat sheet here.
Newer versions of GCC creates better arm code as one would expected. Above snipped is generated by version 4.4.3, and I can confirm Linaro's 4.7.1 proves my claim. So if you take my approach use the most recent tool chain you can get.
http://www.peter-cockerell.net/aalp/html/ch-5.html
;Print characters 32..126 using a FOR loop-type construct
;R0 holds the character
MOV R0, #32 ;Init the character
.loop
SWI WriteC ;Print it
ADD R0, R0, #1 ;Increment it
CMP R0, #126 ;Check the limit
BLE loop ;Loop if not finished
;
for (int i = 0; i < 10000; i++)
a[i] = b[i] + c[i]
mov r0,#0x2700
orr r0,#0x0010
top:
ldr r1,[r9],#4
ldr r2,[r10],#4
add r1,r1,r2
str r1,[r8],#4
subs r0,#1
bne top
To build upon @alpera 's answer - you could also unroll the loop to do 4 ops at once - although whether you get a performance benefit depends whether the memory access or the pipeline stall around the branch is the bigger effect
mov r11,#0x2700
orr r11,#0x0010
top:
ldmia r9!, {r0-r3}
ldmia r10!, {r4-r7}
add r0,r0,r4
add r1,r1,r5
add r2,r2,r6
add r3,r3,r7
stmia r8!, {r0-r3}
subs r11,#4
bne top
If you have NEON unit handy, we could do it that way too - in which case it will parallelize the loads, stores and adds - in effect reducing the problem to 5 instructions that perform two iterations of the loop at once.
A C compiler is will not generate code this tight by default (or paralleize for NEON) as it must assume that the buffers used for reading and writing (r8,r10 and r11) can potentially overlap - hence a write through r8 might immediately be read in the next iteration of the loop through r9 or r10. You can use the restrict
(__restrict
in C++) modifier to tell the compiler that this is not the case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With