I have a bottleneck, which looks like this:
void function(int type) {
for (int i = 0; i < m; i++) {
// do some stuff A
switch (type) {
case 0:
// do some stuff 0
break;
[...]
case n:
// do some stuff n
break;
}
// do some stuff B
}
}
n and m are large enough.
m millions, sometimes hundreds of millions.
n is the 2 ^ 7 - 2 ^ 10 (128 - 1024)
Chunks of code A and B are sufficiently large.
I rewrote the code (via macros) as follows:
void function(int type) {
switch (type) {
case 0:
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff 0
// do some stuff B
}
break;
[...]
case n:
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff n
// do some stuff B
}
break;
}
}
As a result, it looks like this in IDA for this function:
Is there a way to remove the switch from the loop:
A possible solution seems to me the presence of goto variable. Something like this:
void function(int type) {
label* typeLabel;
switch (type) {
case 0:
typeLabel = &label_1;
break;
[...]
case n:
typeLabel = &label_n;
break;
}
for (int i = 0; i < m; i++) {
// do some stuff A
goto *typeLabel;
back:
// do some stuff B
}
goto end;
label_1:
// do some stuff 0
goto back;
[...]
label_n:
// do some stuff n
goto back;
end:
}
The matter is also complicated by the fact that all of this will be carried out on different Android devices with different speeds.
Architecture as ARM, and x86.
Perhaps this can be done assembler inserts rather than pure C?
EDIT:
I run some tests. n = 45,734,912
loop-within-switch: 891,713 μs
switch-within-loop: 976,085 μs
loop-within-switch 9.5% faster from switch-within-loop
For example: simple realisation without switch takes 1,746,947 μs
At the moment, the best solution I can see is:
Generate with macros n
functions, which will look like this:
void func_n() {
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff n
// do some stuff B
}
}
Then make an array of pointers to them, and called from the main function:
void main(int type) {
func* table[n];
// fill table array with pointers to func_0 .. func_n
table[type](); // call appropriate func
}
This allows the optimizer to optimize the compiler function func_0
.. func_n
. Moreover, they will not be so big.
Realistically, a static array of labels is likely the fastest sane option (array of pointers being the sanest fast option). But, let's get creative.
(Note that this should have been a comment, but I need the space).
Let's build on the fact that if a certain outcome of a branch happens, the predictor will likely predict the same outcome in the future. Especially if it happens more than once. The code would look something like:
for (int i = 0; i < m; i++)
{
// do some stuff A
if (type < n/2)
{
if (type < n/4)
{
if (type < n/8)
{
if (type == 0) // do some stuff 0
else // do some stuff 1
}
else
{
...
}
}
else
{
...
}
}
else
{
...
// do some stuff n
}
// do some stuff B
}
Basically, you binary search what to do, in log(n) steps. That is a log(n) possible jumps, but after only one or two iterations, the branch predictor will predict them all correctly, and will speculatively execute the proper instructions without problem. Depending on the CPU, this could be faster than a goto *labelType; back:
as some are unable to prefetch instructions when the jump address is calculated dynamically.
So, ideally, your code would look like:
void function(int type) {
for (int i = 0; i < m; i++) {
// do some stuff A
// do some stuff [type]
// do some stuff B
}
}
With all the other 0..n "stuffs" being junk in the current function invocation. Well, let's make it like that:
void function(int type) {
prepare(type);
for (int i = 0; i < m; i++) {
// do some stuff A
reserved:
doNothing(); doNothing(); doNothing(); doNothing(); doNothing();
// do some stuff B
}
}
The doNothing()
calls are there just to reserve the space in the function. Best implementation would be goto B
. The prepare(type)
function will look in the lookup table for all the 0..n
implementations, take the type
one, and copy it over all those goto B
s. Then, when you are actually executing your loop, you have the optimal code where there are no needless jumps.
Just be sure to have some final goto B
instruction in the stuff
implementation - copying a smaller one over a larger one could cause problems otherwise. Alternatively, before exiting function
you can restore all the placeholder goto B;
instructions. It's a small cost, since you're only doing it once per invocation, not per iteration.
prepare()
would be much easier to implement in assembly than in C, but it is doable. You just need the start/end addresses of all stuff_i
implementations (in your post, these are label_[i]
and label_[i+1]
), and memcpy
that into reserved
.
Maybe the compiler will even let you do:
memcpy((uint8_t*)reserved, (uint8_t*)label_1, (uint8_t*)label_2 - (uint8_t*)label_1);
Likely not, though. You can, however, get the proper locations using setjmp
or something like __builtin_return_address
/ _ReturnAddress
within a function call.
Note that this will require write access to the instruction memory. Getting that is OS specific, and likely requires su/admin privileges.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With