Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

Question

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.

The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):

Intel XScale PXA270

Algorithm A: 22642 ms
Algorithm B: 29271 ms

ARM1136EJ-S core (embedded in a MSM7201A chip)

Algorithm A: 24874 ms
Algorithm B: 29504 ms

ARM926EJ-S core (embedded in an OMAP 850 chip)

Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)

I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.

So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.

Thanks in advance.

Stephen Canon · Accepted Answer

One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.

That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

Tags:

optimization

arm

Combuster

1 Answers

Stephen Canon

Recent Activity

Donate For Us

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

Tags:

optimization

arm

Combuster

1 Answers

Stephen Canon

Related questions

Recent Activity

Donate For Us