I am writing a Linux Kernel driver (for ARM) and in an irq handler I need to check the interrupt bits. <pre class="prettyprint"><code>bit 0/16 End point 0 In/Out interrupt (very likely, while In is more likely) 1/17 End point 1 In/Out interrupt ... 15/31 End point 15 In/Out interrupt </code></pre> Note that more than a bit can be set at a time. So this is the code: <pre class="prettyprint"><code>int i; u32 intr = read_interrupt_register(); /* ep0 IN */ if(likely(intr & (1 << 0))){ handle_ep0_in(); } /* ep0 OUT */ if(likely(intr & (1 << 16))){ handle_ep0_out(); } for(i=1;i<16;++i){ if(unlikely(intr & (1 << i))){ handle_ep_in(i); } if(unlikely(intr & (1 << (i + 16)))){ handle_ep_out(i); } } </code></pre> <code>(1 << 0)</code> and <code>(1 << 16)</code> would be calculated in compile time, however <code>(1 << i)</code> and <code>(1 << (i + 16))</code> wouldn't. Also there would be integral comparison and addition in the loop. Because it is an irq handler, work should be done within the shortest time. This let me think whether I need to optimize it a bit. <h3>Possible ways?</h3> 1. Split the loop, seems to make no difference... <pre class="prettyprint"><code>/* ep0 IN */ if(likely(intr & (1 << 0))){ handle_ep0_in(); } /* ep0 OUT */ if(likely(intr & (1 << 16))){ handle_ep0_out(); } for(i=1;i<16;++i){ if(unlikely(intr & (1 << i))){ handle_ep_in(i); } } for(i=17;i<32;++i){ if(unlikely(intr & (1 << i))){ handle_ep_out(i - 16); } } </code></pre> 2. Shift <code>intr</code> instead of the value to be compared to? <pre class="prettyprint"><code>/* ep0 IN */ if(likely(intr & (1 << 0))){ handle_ep0_in(); } /* ep0 OUT */ if(likely(intr & (1 << 16))){ handle_ep0_out(); } for(i=1;i<16;++i){ intr >>= 1; if(unlikely(intr & 1)){ handle_ep_in(i); } } intr >>= 1; for(i=1;i<16;++i){ intr >>= 1; if(unlikely(intr & 1)){ handle_ep_out(i); } } </code></pre> 3. Fully unroll the loop (not shown). That would make the code a bit messy. 4. Any other better ways? 5. Or it's that the compiler will actually generate the most optimized way? <hr> Edit: I was looking for a way to tell the gcc compiler to unroll that particular loop, but it seems that it isn't possible according to my search...

If we can assume that the number of set bits in intr is low (as it is usually the case in interrupt masks) we can optimize a little bit and write a loop that executes for each bit only once: <pre class="prettyprint"><code>void handle (int intr) { while (intr) { // find index of lowest bit set in intr: int bit_id = __builtin_ffs(intr)-1; // call handler: if (bit_id > 16) handle_ep_out (bit_id-16); else handle_ep_in (bit_id); // clear that bit // (I think there was a bit-hack out there to simplify this step even further) intr -= (1<<bit_id); } } </code></pre> On most ARM architectures __builtin_ffs will compile down to a CLZ instruction and some arithmetic around it. It should do so for anything but ARM7 and older cores. Also: When writing interrupt handlers on embedded devices the size of the function makes a difference for performance as well because the instructions have to be loaded into the code-cache. Lean code usually executes faster. A bit overhead is okay if you save memory accesses to memory that is unlikely to be in the cache.

I would probably go for option 5 myself. Code for readability and let gcc's insane optimisation level <code>-O3</code> do what it can. I've seen code generated at that level that I can't even understand. Any hand-crafted optimisation in C (other than possibly unrolling and using constants rather than runtime bit shifts, a la option 3) is unlikely to outperform what the compiler itself can do. I think you'll find that the unrolling may not be as messy as you think: <pre class="prettyprint"><code>if ( likely(intr & 0x00000001)) handle_ep0_in(); if ( likely(intr & 0x00010000)) handle_ep0_out(); if (unlikely(intr & 0x00000002)) handle_ep_in(1); if (unlikely(intr & 0x00020000)) handle_ep_out(1); : if (unlikely(intr & 0x00008000)) handle_ep_in(15); if (unlikely(intr & 0x80000000)) handle_ep_out(15); </code></pre> In fact, you can make it a lot less messier with macros (untested, but you should get the general idea): <pre class="prettyprint"><code>// Since mask is a constant, "mask << 32" should be too. # define chkintr (mask, num) \ if (unlikely(intr & (mask ))) handle_ep_in (num); \ if (unlikely(intr & (mask << 32))) handle_ep_out (num); // Special case for high probability bit. if (likely(intr & 0x00000001UL)) handle_ep0_in(); if (likely(intr & 0x00010000UL)) handle_ep0_out(); chkintr (0x0002UL, 1); chkintr (0x0004UL, 2); chkintr (0x0008UL, 3); chkintr (0x0010UL, 4); chkintr (0x0020UL, 5); chkintr (0x0040UL, 6); chkintr (0x0080UL, 7); chkintr (0x0100UL, 8); chkintr (0x0200UL, 9); chkintr (0x0400UL, 10); chkintr (0x0800UL, 11); chkintr (0x1000UL, 12); chkintr (0x2000UL, 13); chkintr (0x4000UL, 14); chkintr (0x8000UL, 15); </code></pre> The only step up from there is hand-coding assembly language and there's still the good possibility that gcc may be able to outperform you :-)

Loop unroll (with bitwise operations)

Q: How do you unroll a loop?

A loop can be unrolled by replicating the loop body a number of times and then changing the termination logic to comprehend the multiple iterations of the loop body (Figure 6.22). The loops in Figures 6.22a and 6.22b each take four cycles to execute, but the loop in Figure 6.22b is doing four times as much work!

Q: Why are unrolled loops faster?

But why would unrolled loops be faster in the first place? One reason for their increased performance is that they lead to fewer instructions being executed. Let us estimate the number of instructions that we need to be executed with each iteration of the simple (rolled) loop. We need to load two values into registers.

Q: Does loop unrolling help?

Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.

Q: What is Pragma loop unrolling?

The UNROLL pragma specifies to the compiler how many times a loop should be unrolled. The UNROLL pragma is useful for helping the compiler utilize SIMD instructions. It is also useful in cases where better utilization of software pipeline resources are needed over a non-unrolled loop.

Tags:

c

bit-manipulation

linux-kernel

loop-unrolling

I am writing a Linux Kernel driver (for ARM) and in an irq handler I need to check the interrupt bits.

bit
 0/16  End point 0 In/Out interrupt
       (very likely, while In is more likely)
 1/17  End point 1 In/Out interrupt
 ...
15/31  End point 15 In/Out interrupt

Note that more than a bit can be set at a time.

So this is the code:

int i;
u32 intr = read_interrupt_register();

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_in(i);
    }
    if(unlikely(intr & (1 << (i + 16)))){
        handle_ep_out(i);
    }
}

(1 << 0) and (1 << 16) would be calculated in compile time, however (1 << i) and (1 << (i + 16)) wouldn't. Also there would be integral comparison and addition in the loop.

Because it is an irq handler, work should be done within the shortest time. This let me think whether I need to optimize it a bit.

Possible ways?

1. Split the loop, seems to make no difference...

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_in(i);
    }
}
for(i=17;i<32;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_out(i - 16);
    }
}

2. Shift intr instead of the value to be compared to?

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    intr >>= 1;
    if(unlikely(intr & 1)){
        handle_ep_in(i);
    }
}
intr >>= 1;
for(i=1;i<16;++i){
    intr >>= 1;
    if(unlikely(intr & 1)){
        handle_ep_out(i);
    }
}

3. Fully unroll the loop (not shown). That would make the code a bit messy.

4. Any other better ways?

5. Or it's that the compiler will actually generate the most optimized way?

Edit: I was looking for a way to tell the gcc compiler to unroll that particular loop, but it seems that it isn't possible according to my search...

338

asked Sep 13 '12 07:09

Alvin Wong

2 Answers

If we can assume that the number of set bits in intr is low (as it is usually the case in interrupt masks) we can optimize a little bit and write a loop that executes for each bit only once:

void handle (int intr)
{
  while (intr)
  {
    // find index of lowest bit set in intr:
    int bit_id = __builtin_ffs(intr)-1;

    // call handler:
    if (bit_id > 16)
      handle_ep_out (bit_id-16);
    else
      handle_ep_in (bit_id);

    // clear that bit
    // (I think there was a bit-hack out there to simplify this step even further)
    intr -= (1<<bit_id);
  }
}

On most ARM architectures __builtin_ffs will compile down to a CLZ instruction and some arithmetic around it. It should do so for anything but ARM7 and older cores.

Also: When writing interrupt handlers on embedded devices the size of the function makes a difference for performance as well because the instructions have to be loaded into the code-cache. Lean code usually executes faster. A bit overhead is okay if you save memory accesses to memory that is unlikely to be in the cache.

answered Sep 20 '22 07:09

Nils Pipenbrinck

I would probably go for option 5 myself. Code for readability and let gcc's insane optimisation level -O3 do what it can.

I've seen code generated at that level that I can't even understand.

Any hand-crafted optimisation in C (other than possibly unrolling and using constants rather than runtime bit shifts, a la option 3) is unlikely to outperform what the compiler itself can do.

I think you'll find that the unrolling may not be as messy as you think:

if (  likely(intr & 0x00000001)) handle_ep0_in();
if (  likely(intr & 0x00010000)) handle_ep0_out();

if (unlikely(intr & 0x00000002)) handle_ep_in(1);
if (unlikely(intr & 0x00020000)) handle_ep_out(1);

:

if (unlikely(intr & 0x00008000)) handle_ep_in(15);
if (unlikely(intr & 0x80000000)) handle_ep_out(15);

In fact, you can make it a lot less messier with macros (untested, but you should get the general idea):

// Since mask is a constant, "mask << 32" should be too.

# define chkintr (mask, num) \
    if (unlikely(intr & (mask      ))) handle_ep_in  (num); \
    if (unlikely(intr & (mask << 32))) handle_ep_out (num);

// Special case for high probability bit.

if (likely(intr & 0x00000001UL)) handle_ep0_in();
if (likely(intr & 0x00010000UL)) handle_ep0_out();

chkintr (0x0002UL,  1);  chkintr (0x0004UL,  2);  chkintr (0x0008UL,  3);
chkintr (0x0010UL,  4);  chkintr (0x0020UL,  5);  chkintr (0x0040UL,  6);
chkintr (0x0080UL,  7);  chkintr (0x0100UL,  8);  chkintr (0x0200UL,  9);
chkintr (0x0400UL, 10);  chkintr (0x0800UL, 11);  chkintr (0x1000UL, 12);
chkintr (0x2000UL, 13);  chkintr (0x4000UL, 14);  chkintr (0x8000UL, 15);

The only step up from there is hand-coding assembly language and there's still the good possibility that gcc may be able to outperform you :-)

answered Sep 20 '22 07:09

paxdiablo

Related questions
                            
                                JNA C DLL Debug Howto?
                            
                                How to be notified when a thread has been terminated for some error
                            
                                What should I do to get the whole return value of c-program from command line?
                            
                                GDB: Watch a variable in a given scope
                            
                                Library like ENet, but for TCP?
                            
                                Practical use of Linux real time scheduling priorities (SCHED_FIFO and SCHED_RR)?
                            
                                Dillema with buffer overflow
                            
                                Get the height of the Windows taskbar (Winapi)
                            
                                How to use ioctl() from kernel space in Linux?
                            
                                Minimal overhead way of intercepting system calls without modifying the kernel
                            
                                POSIX-compatible regex library for Visual Studio C
                            
                                SECCOMP: How to emulate malloc, realloc and free?
                            
                                Initializing a variable and specifying the storage address the same time: is it possible?
                            
                                How can I implement a dynamic dispatch table in C
                            
                                Is struct addrinfo **res allocated when getaddrinfo() returns a non-zero value?
                            
                                Simple C audio library
                            
                                Libexif , appending new exif data
                            
                                Is snprintf or vsnprintf better, and how can I ensure I'm using them securely?
                            
                                What are legitimate uses for function-like macros? [closed]
                            
                                How can I create a DLL in C using Visual Studio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With