I am writing a Linux Kernel driver (for ARM) and in an irq handler I need to check the interrupt bits.
bit
0/16 End point 0 In/Out interrupt
(very likely, while In is more likely)
1/17 End point 1 In/Out interrupt
...
15/31 End point 15 In/Out interrupt
Note that more than a bit can be set at a time.
So this is the code:
int i;
u32 intr = read_interrupt_register();
/* ep0 IN */
if(likely(intr & (1 << 0))){
handle_ep0_in();
}
/* ep0 OUT */
if(likely(intr & (1 << 16))){
handle_ep0_out();
}
for(i=1;i<16;++i){
if(unlikely(intr & (1 << i))){
handle_ep_in(i);
}
if(unlikely(intr & (1 << (i + 16)))){
handle_ep_out(i);
}
}
(1 << 0)
and (1 << 16)
would be calculated in compile time, however (1 << i)
and (1 << (i + 16))
wouldn't. Also there would be integral comparison and addition in the loop.
Because it is an irq handler, work should be done within the shortest time. This let me think whether I need to optimize it a bit.
1. Split the loop, seems to make no difference...
/* ep0 IN */
if(likely(intr & (1 << 0))){
handle_ep0_in();
}
/* ep0 OUT */
if(likely(intr & (1 << 16))){
handle_ep0_out();
}
for(i=1;i<16;++i){
if(unlikely(intr & (1 << i))){
handle_ep_in(i);
}
}
for(i=17;i<32;++i){
if(unlikely(intr & (1 << i))){
handle_ep_out(i - 16);
}
}
2. Shift intr
instead of the value to be compared to?
/* ep0 IN */
if(likely(intr & (1 << 0))){
handle_ep0_in();
}
/* ep0 OUT */
if(likely(intr & (1 << 16))){
handle_ep0_out();
}
for(i=1;i<16;++i){
intr >>= 1;
if(unlikely(intr & 1)){
handle_ep_in(i);
}
}
intr >>= 1;
for(i=1;i<16;++i){
intr >>= 1;
if(unlikely(intr & 1)){
handle_ep_out(i);
}
}
3. Fully unroll the loop (not shown). That would make the code a bit messy.
4. Any other better ways?
5. Or it's that the compiler will actually generate the most optimized way?
Edit: I was looking for a way to tell the gcc compiler to unroll that particular loop, but it seems that it isn't possible according to my search...
A loop can be unrolled by replicating the loop body a number of times and then changing the termination logic to comprehend the multiple iterations of the loop body (Figure 6.22). The loops in Figures 6.22a and 6.22b each take four cycles to execute, but the loop in Figure 6.22b is doing four times as much work!
But why would unrolled loops be faster in the first place? One reason for their increased performance is that they lead to fewer instructions being executed. Let us estimate the number of instructions that we need to be executed with each iteration of the simple (rolled) loop. We need to load two values into registers.
Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.
The UNROLL pragma specifies to the compiler how many times a loop should be unrolled. The UNROLL pragma is useful for helping the compiler utilize SIMD instructions. It is also useful in cases where better utilization of software pipeline resources are needed over a non-unrolled loop.
If we can assume that the number of set bits in intr is low (as it is usually the case in interrupt masks) we can optimize a little bit and write a loop that executes for each bit only once:
void handle (int intr)
{
while (intr)
{
// find index of lowest bit set in intr:
int bit_id = __builtin_ffs(intr)-1;
// call handler:
if (bit_id > 16)
handle_ep_out (bit_id-16);
else
handle_ep_in (bit_id);
// clear that bit
// (I think there was a bit-hack out there to simplify this step even further)
intr -= (1<<bit_id);
}
}
On most ARM architectures __builtin_ffs will compile down to a CLZ instruction and some arithmetic around it. It should do so for anything but ARM7 and older cores.
Also: When writing interrupt handlers on embedded devices the size of the function makes a difference for performance as well because the instructions have to be loaded into the code-cache. Lean code usually executes faster. A bit overhead is okay if you save memory accesses to memory that is unlikely to be in the cache.
I would probably go for option 5 myself. Code for readability and let gcc's insane optimisation level -O3
do what it can.
I've seen code generated at that level that I can't even understand.
Any hand-crafted optimisation in C (other than possibly unrolling and using constants rather than runtime bit shifts, a la option 3) is unlikely to outperform what the compiler itself can do.
I think you'll find that the unrolling may not be as messy as you think:
if ( likely(intr & 0x00000001)) handle_ep0_in();
if ( likely(intr & 0x00010000)) handle_ep0_out();
if (unlikely(intr & 0x00000002)) handle_ep_in(1);
if (unlikely(intr & 0x00020000)) handle_ep_out(1);
:
if (unlikely(intr & 0x00008000)) handle_ep_in(15);
if (unlikely(intr & 0x80000000)) handle_ep_out(15);
In fact, you can make it a lot less messier with macros (untested, but you should get the general idea):
// Since mask is a constant, "mask << 32" should be too.
# define chkintr (mask, num) \
if (unlikely(intr & (mask ))) handle_ep_in (num); \
if (unlikely(intr & (mask << 32))) handle_ep_out (num);
// Special case for high probability bit.
if (likely(intr & 0x00000001UL)) handle_ep0_in();
if (likely(intr & 0x00010000UL)) handle_ep0_out();
chkintr (0x0002UL, 1); chkintr (0x0004UL, 2); chkintr (0x0008UL, 3);
chkintr (0x0010UL, 4); chkintr (0x0020UL, 5); chkintr (0x0040UL, 6);
chkintr (0x0080UL, 7); chkintr (0x0100UL, 8); chkintr (0x0200UL, 9);
chkintr (0x0400UL, 10); chkintr (0x0800UL, 11); chkintr (0x1000UL, 12);
chkintr (0x2000UL, 13); chkintr (0x4000UL, 14); chkintr (0x8000UL, 15);
The only step up from there is hand-coding assembly language and there's still the good possibility that gcc may be able to outperform you :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With