Fastest polling loop - how can I trim 1 CPU cycle?

Question

In a realtime application¹ on an ARM Cortex M3 (similar to STM32F101), I need to poll a bit of an internal peripheral's register until it's zero, in as tight a loop as possible. I use bit banding to access the appropriate bit. The (working) C code is

while (*(volatile uint32_t*)kMyBit != 0);

That code is copied in on-chip executable RAM. After some manual optimization² the polling loop is down to the following, that I timed³ to 6 cycles:

0x00600200 681A      LDR      r2,[r3,#0x00]
0x00600202 2A00      CMP      r2,#0x00
0x00600204 D1FC      BNE      0x00600200

How can the polling uncertainty be lowered? A 5-cycle loop would fit my goal: sample that same bit as close as possible to 15.5 cycles after it went to zero.

My spec calls for reliably detecting a low pulse at least 6.5 CPU clock cycles; reliably classifying it as short if it lasts less than 12.5 cycles; and reliably classifying it as long if its lasts more than 18.5 cycles. The pulses have no defined phase relationship with the CPU clock, which is my only accurate timing reference. That requires an at-most 5-clock polling loop. Actually I'm emulating code that ran on a decades-old 8-bit CPU that could poll with a 5-clock cycle, and what that did has become the spec.

I tried to offset the code alignment by inserting NOP before the loop, in the many variants I tried, but never observed a change.

I tried to invert the CMP and LDR, but still get 6 cycles:

0x00600200 681A      LDR      r2,[r3,#0x00]
; we loop here
0x00600202 2A00      CMP      r2,#0x00
0x00600204 681A      LDR      r2,[r3,#0x00]
0x00600206 D1FC      BNE      0x00600202

This one is 8 cycles

0x00600200 681A      LDR      r2,[r3,#0x00]
0x00600202 681A      LDR      r2,[r3,#0x00]
0x00600204 2A00      CMP      r2,#0x00
0x00600206 D1FB      BNE      0x00600200

But this one is 9 cycles:

0x00600200 681A      LDR      r2,[r3,#0x00]
0x00600202 2A00      CMP      r2,#0x00
0x00600204 681A      LDR      r2,[r3,#0x00]
0x00600206 D1FB      BNE      0x00600200

¹ Measuring how much time the bit is low, in a context where no interrupt occurs.

² The initial compiler-generated code used r12 as the destination register, and that added 4 code bytes to the loop, costing 1 cycle.

³ The numbers given are obtained with a supposedly cycle-accurate real-time STIce emulator and its emulator trigger feature on read at the register's address. Previously I tried the "States" counter with a breakpoint in the loop, but the result depends on the breakpoint's location. Single-step is even worse: it always give 4 cycles for LDR, when that's at least sometime down to 3.

alex_mv · Accepted Answer

If I understand the question correctly, it's not necessarily the loop cycles that need to be reduced, but the number of cycles between consequent samples (i.e. LDR instructions). But there can be more than one LDR per iteration. You can try something like this:

    ldrb    r1, [r0]

loop:
    cbz     r1, out
    ldrb    r2, [r0]
    cbz     r2, out
    ldrb    r1, [r0]
    b       loop

out:

The spacing between the two LDRB instructions varies so the samples aren't uniformly spaced.

This may delay exit from the loop slightly, but from the problem description I can't say if it's important or not.

I happen to have access to cycle-accurate M7 model, and when the process stabilises your original loop runs on M7 in 3 cycles per iteration (meaning LDR every 3 cycles), while the proposed loop above runs in 4 cycles, but now there are two LDRs in there (so LDR every 2 cycles). Sampling rate is definitely improved.

To give credit, unrolling with CBZ as a break was proposed by @Peter Cordes in a comment.

Admittedly M3 will be slower but it's still worth a shot, if it's the sampling rate you're after.

Also you can check if LDRB instead of LDR (as in the code above) changes anything, although I don't expect it to.

UPD: I have another 2-LDR loop version which on M7 completes in 3 cycles which you can try out of interest (also CBZ breaks allow for easy balancing of the paths after the loop):

    ldr     r1, [r0]

loop:
    ldr     r2, [r0]
    cbz     r1, out_slow
    cbz     r2, out_fast
    ldr     r1, [r0]
    b       loop

out_fast:
    /* NOPs as required */

out_slow:

Fastest polling loop - how can I trim 1 CPU cycle?

Tags:

assembly

micro-optimization

cortex-m3

fgrieu

1 Answers

alex_mv

Recent Activity

Donate For Us

Fastest polling loop - how can I trim 1 CPU cycle?

Tags:

assembly

micro-optimization

cortex-m3

fgrieu

1 Answers

alex_mv

Related questions

Recent Activity

Donate For Us