In a realtime application¹ on an ARM Cortex M3 (similar to STM32F101), I need to poll a bit of an internal peripheral's register until it's zero, in as tight a loop as possible. I use bit banding to access the appropriate bit. The (working) C code is
while (*(volatile uint32_t*)kMyBit != 0);
That code is copied in on-chip executable RAM. After some manual optimization² the polling loop is down to the following, that I timed³ to 6 cycles:
0x00600200 681A LDR r2,[r3,#0x00]
0x00600202 2A00 CMP r2,#0x00
0x00600204 D1FC BNE 0x00600200
How can the polling uncertainty be lowered? A 5-cycle loop would fit my goal: sample that same bit as close as possible to 15.5 cycles after it went to zero.
My spec calls for reliably detecting a low pulse at least 6.5 CPU clock cycles; reliably classifying it as short if it lasts less than 12.5 cycles; and reliably classifying it as long if its lasts more than 18.5 cycles. The pulses have no defined phase relationship with the CPU clock, which is my only accurate timing reference. That requires an at-most 5-clock polling loop. Actually I'm emulating code that ran on a decades-old 8-bit CPU that could poll with a 5-clock cycle, and what that did has become the spec.
I tried to offset the code alignment by inserting NOP before the loop, in the many variants I tried, but never observed a change.
I tried to invert the CMP and LDR, but still get 6 cycles:
0x00600200 681A LDR r2,[r3,#0x00]
; we loop here
0x00600202 2A00 CMP r2,#0x00
0x00600204 681A LDR r2,[r3,#0x00]
0x00600206 D1FC BNE 0x00600202
This one is 8 cycles
0x00600200 681A LDR r2,[r3,#0x00]
0x00600202 681A LDR r2,[r3,#0x00]
0x00600204 2A00 CMP r2,#0x00
0x00600206 D1FB BNE 0x00600200
But this one is 9 cycles:
0x00600200 681A LDR r2,[r3,#0x00]
0x00600202 2A00 CMP r2,#0x00
0x00600204 681A LDR r2,[r3,#0x00]
0x00600206 D1FB BNE 0x00600200
¹ Measuring how much time the bit is low, in a context where no interrupt occurs.
² The initial compiler-generated code used r12 as the destination register, and that added 4 code bytes to the loop, costing 1 cycle.
³ The numbers given are obtained with a supposedly cycle-accurate real-time STIce emulator and its emulator trigger feature on read at the register's address. Previously I tried the "States" counter with a breakpoint in the loop, but the result depends on the breakpoint's location. Single-step is even worse: it always give 4 cycles for LDR, when that's at least sometime down to 3.
If I understand the question correctly, it's not necessarily the loop cycles that need to be reduced, but the number of cycles between consequent samples (i.e. LDR instructions). But there can be more than one LDR per iteration. You can try something like this:
ldrb r1, [r0]
loop:
cbz r1, out
ldrb r2, [r0]
cbz r2, out
ldrb r1, [r0]
b loop
out:
The spacing between the two LDRB instructions varies so the samples aren't uniformly spaced.
This may delay exit from the loop slightly, but from the problem description I can't say if it's important or not.
I happen to have access to cycle-accurate M7 model, and when the process stabilises your original loop runs on M7 in 3 cycles per iteration (meaning LDR every 3 cycles), while the proposed loop above runs in 4 cycles, but now there are two LDRs in there (so LDR every 2 cycles). Sampling rate is definitely improved.
To give credit, unrolling with CBZ as a break was proposed by @Peter Cordes in a comment.
Admittedly M3 will be slower but it's still worth a shot, if it's the sampling rate you're after.
Also you can check if LDRB instead of LDR (as in the code above) changes anything, although I don't expect it to.
UPD: I have another 2-LDR loop version which on M7 completes in 3 cycles which you can try out of interest (also CBZ breaks allow for easy balancing of the paths after the loop):
ldr r1, [r0]
loop:
ldr r2, [r0]
cbz r1, out_slow
cbz r2, out_fast
ldr r1, [r0]
b loop
out_fast:
/* NOPs as required */
out_slow:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With