I have a performance issue in a bottleneck section in my code. Basically it's a simple nested loop.
Profiling the issue reveals that the program spends a lot of time just incrementing both of the loop counters (++) and testing for termination (i/j < 8).
Watching the assembly output I see that both counters don't get registers and accessing them costs a lot of cycles. Using the "register" keyword doesn't convince the compiler to actually put them in registers. is there something which can be done to to optimize the counters access time?
Here's the assembly output. The C source is just a simple nested loop with i/j counters.
2738 0.2479 2459 0.1707 : 1e6c: jne 1dd1 <process_hooks+0x121>
1041 0.0942 1120 0.0778 : 1e72: addl $0x1,0xffffffd4(%ebp)
2130 0.1928 2102 0.1459 : 1e76: cmpl $0x8,0xffffffd4(%ebp)
2654 0.2403 2337 0.1622 : 1e7a: jne 1da0 <process_hooks+0xf0>
809 0.0732 814 0.0565 : 1e80: jmp 1ce2 <process_hooks+0x32>
As requested, here's the C code as well. Compiler is gcc btw:
for (byte_index=0; byte_index < MASK_SIZE / NBBY; byte_index++)
{
if (check_byte(mask,byte_index))
{
for (bit_index=0; bit_index < NBBY; bit_index++)
{
condition_index = byte_index*NBBY + bit_index;
if (check_bit(condition_mask,condition_index))
{
.
.
.
}
}
}
}
Thanks
do-while is fastest to run the first iteration as there is no checking of a condition at the start.
There are two possible reasons it doesn't get put in a register:
If you are taking the address of the variable or declaring it volatile, it wont be kept in a register. It doesn't look like you are doing that, but it might happen in the ...
section.
This is quite likely. gcc seems to have a poor allocator (based on the comments of its developers). In addition, register allocation is fickle and hard to reason about. You will probably be able to tweak it to get some benefits using register allocator optimizations. If you like, you can set them for that function only.
gcc 4.4 has a new register allocator, which is supposed to be better, but also allows you to choose the allocation algorithm. That will provide extra things to tweak.
You can also try telling gcc to try harder, with the hot attribute.
Finally, you can also tweak things using gcc's --param
flags. They expose internal compiler settings, so this should probably not be embarked upon lightly.
When getting a performance bottle neck in a loop counter you should consider unrolling the loop.
EDIT: As always, when optimizing, make sure you benchmark and convince yourself that you get the desired result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With