Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.

The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):

Intel XScale PXA270

  • Algorithm A: 22642 ms
  • Algorithm B: 29271 ms

ARM1136EJ-S core (embedded in a MSM7201A chip)

  • Algorithm A: 24874 ms
  • Algorithm B: 29504 ms

ARM926EJ-S core (embedded in an OMAP 850 chip)

  • Algorithm A: 70215 ms
  • Algorithm B: 31652 ms (!)

I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.

So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.

Thanks in advance.

like image 326
Combuster Avatar asked Feb 20 '26 09:02

Combuster


1 Answers

One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.

That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.

like image 61
Stephen Canon Avatar answered Feb 21 '26 23:02

Stephen Canon