ARM Cortex A8 Benchmarks: can someone help me make sense of these numbers?

Question

I'm working on writing several real-time DSP algorithms on Android, so I decided to program the ARM directly in Assembly to optimize everything as much as possible and make the math maximally lightweight. At first I was getting speed benchmarks that didn't make a whole lot of sense so I started reading about pipeline hazards, dual-issue capabilities and so on. I'm still puzzled by some of the numbers I'm getting, so I'm posting them here in hope that someone can shed some light on why I get what I get. In particular, I'm interested in why NEON takes different amounts of time to run calculations on different datatypes even though it claims to do each operation in exactly one cycle. My findings are as follows.

I'm using a very simple loop for benchmarking, and I run it for 2,000,000 iterations. Here's my function:

hzrd_test:

    @use received argument an number of iterations in a loop
    mov r3 , r0

    @come up with some simple values
    mov r0, #1
    mov r1, #2

    @Initialize some NEON registers (Q0-Q11)
    vmov.32 d0, r0, r1
    vmov.32 d1, r0, r1
    vmov.32 d2, r0, r1

    ...

    vmov.32 d21, r0, r1
    vmov.32 d22, r0, r1
    vmov.32 d23, r0, r1

hzrd_loop:

    @do some math
    vadd.s32 q0, q0, q1
    vadd.s32 q1, q0, q1
    vadd.s32 q2, q0, q1
    vadd.s32 q3, q0, q1
    vadd.s32 q4, q0, q1
    vadd.s32 q5, q0, q1
    vadd.s32 q6, q0, q1
    vadd.s32 q7, q0, q1
    vadd.s32 q8, q0, q1
    vadd.s32 q9, q0,s q1
    vadd.s32 q10, q0, q1
    vadd.s32 q11, q0, q1

    @decrement loop counter, branch to loop again or return
    subs r3, r3, #1
    bne hzrd_loop

    @return
    mov r0, r3
    mov pc, lr

Notice the computation operation and datatype specified as vector add (vadd) and signed 32-bit int (s32). This operation completes within a certain time (see results table below). According to this ARM Cortex-A8 document and following pages, almost all elementary arithmetic operation in NEON should complete in one cycle, but here's what I'm getting:

vmul.f32 ~62ms
vmul.u32 ~125ms
vmul.s32 ~125ms

vadd.f32 ~63ms
vadd.u32 ~29ms
vadd.s32 ~30ms

I do them by simply replacing the operations and datatypes of everything in the above loop. Is there a reason vadd.u32 is twice faster than vadd.f32 and vmul.f32 is twice faster than vmul.u32?

Cheers! = )

Jake 'Alquimista' LEE · Accepted Answer

Wow, your results are VERY accurate :

32bit integer Q multiply costs 4 cycles while float takes 2.
32bit integer Q add costs 1 cycle while float takes 2.

Nice experiment.

Maybe you already know, but be careful while coding for NEON :

do not access memory with ARM while NEON is doing heavy job
do not mix VFP instructions with NEON's. (except for those shared ones)
do not access S registers.
do not transfer from NEON registers to ARM's

All of those above will cause HUGE hiccups.

Good Luck!

PS : I'd rather optimize for A9 instead(slightly different cycle timings) since pretty much all new devices are coming with A9. And the A9 timing chart from ARM is much more readable. :-)

ARM Cortex A8 Benchmarks: can someone help me make sense of these numbers?

Tags:

assembly

benchmarking

arm

neon

cortex-a8

Phonon

1 Answers

Jake 'Alquimista' LEE

Recent Activity

Donate For Us

ARM Cortex A8 Benchmarks: can someone help me make sense of these numbers?

Tags:

assembly

benchmarking

arm

neon

cortex-a8

Phonon

1 Answers

Jake 'Alquimista' LEE

Related questions

Recent Activity

Donate For Us