Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C How extract predefined huge switch from huge loop without loss performance?

Tags:

c

x86

arm

c89

I have a bottleneck, which looks like this:

void function(int type) {
    for (int i = 0; i < m; i++) {
        // do some stuff A
        switch (type) {
        case 0:
            // do some stuff 0
            break;
        [...]
        case n:
            // do some stuff n
            break;
        }
        // do some stuff B
    }
}

n and m are large enough.

m millions, sometimes hundreds of millions.

n is the 2 ^ 7 - 2 ^ 10 (128 - 1024)

Chunks of code A and B are sufficiently large.

I rewrote the code (via macros) as follows:

void function(int type) {
    switch (type) {
    case 0:
        for (int i = 0; i < m; i++) {
            // do some stuff A
            // do some stuff 0
            // do some stuff B
        }
        break;
    [...]
    case n:
        for (int i = 0; i < m; i++) {
            // do some stuff A
            // do some stuff n
            // do some stuff B
        }
        break;
    }   
}

As a result, it looks like this in IDA for this function: IDA graph

Is there a way to remove the switch from the loop:

  1. without creating a bunch of copies of the loop
  2. not create huge function with macros
  3. without losing performance?

A possible solution seems to me the presence of goto variable. Something like this:

void function(int type) {
    label* typeLabel;
    switch (type) {
    case 0:
        typeLabel = &label_1;
        break;
    [...]
    case n:
        typeLabel = &label_n;
        break;
    }

    for (int i = 0; i < m; i++) {
        // do some stuff A
        goto *typeLabel;
        back:
        // do some stuff B
    }

    goto end;

    label_1:
    // do some stuff 0
    goto back;
    [...]
    label_n:
    // do some stuff n
    goto back;

    end:
}

The matter is also complicated by the fact that all of this will be carried out on different Android devices with different speeds.

Architecture as ARM, and x86.

Perhaps this can be done assembler inserts rather than pure C?

EDIT:

I run some tests. n = 45,734,912

loop-within-switch: 891,713 μs

switch-within-loop: 976,085 μs

loop-within-switch 9.5% faster from switch-within-loop

For example: simple realisation without switch takes 1,746,947 μs

like image 586
Enyby Avatar asked Feb 09 '23 01:02

Enyby


2 Answers

At the moment, the best solution I can see is:

Generate with macros n functions, which will look like this:

void func_n() {
    for (int i = 0; i < m; i++) {
        // do some stuff A
        // do some stuff n
        // do some stuff B
    }
}

Then make an array of pointers to them, and called from the main function:

void main(int type) {
    func* table[n];
    // fill table array with pointers to func_0 .. func_n

    table[type](); // call appropriate func
}

This allows the optimizer to optimize the compiler function func_0 .. func_n. Moreover, they will not be so big.

like image 70
Enyby Avatar answered Feb 11 '23 23:02

Enyby


Realistically, a static array of labels is likely the fastest sane option (array of pointers being the sanest fast option). But, let's get creative.

(Note that this should have been a comment, but I need the space).

Option 1: Exploit the branch predictor

Let's build on the fact that if a certain outcome of a branch happens, the predictor will likely predict the same outcome in the future. Especially if it happens more than once. The code would look something like:

for (int i = 0; i < m; i++) 
{
    // do some stuff A
    if (type < n/2) 
    {
        if (type < n/4) 
        {
            if (type < n/8) 
            {
                if (type == 0) // do some stuff 0
                else           // do some stuff 1
            } 
            else 
            {
                ...
            }
        } 
        else 
        {
             ...
        }
    } 
    else 
    {
        ...
        // do some stuff n
    }

    // do some stuff B
}

Basically, you binary search what to do, in log(n) steps. That is a log(n) possible jumps, but after only one or two iterations, the branch predictor will predict them all correctly, and will speculatively execute the proper instructions without problem. Depending on the CPU, this could be faster than a goto *labelType; back: as some are unable to prefetch instructions when the jump address is calculated dynamically.

Option 2: JIT load the proper 'stuff'

So, ideally, your code would look like:

void function(int type) {
    for (int i = 0; i < m; i++) {
        // do some stuff A
        // do some stuff [type]
        // do some stuff B
    }
}

With all the other 0..n "stuffs" being junk in the current function invocation. Well, let's make it like that:

void function(int type) {
    prepare(type);
    for (int i = 0; i < m; i++) {
        // do some stuff A
        reserved:
        doNothing(); doNothing(); doNothing(); doNothing(); doNothing();
        // do some stuff B
    }
}

The doNothing() calls are there just to reserve the space in the function. Best implementation would be goto B. The prepare(type) function will look in the lookup table for all the 0..n implementations, take the type one, and copy it over all those goto Bs. Then, when you are actually executing your loop, you have the optimal code where there are no needless jumps.

Just be sure to have some final goto B instruction in the stuff implementation - copying a smaller one over a larger one could cause problems otherwise. Alternatively, before exiting function you can restore all the placeholder goto B; instructions. It's a small cost, since you're only doing it once per invocation, not per iteration.

prepare() would be much easier to implement in assembly than in C, but it is doable. You just need the start/end addresses of all stuff_i implementations (in your post, these are label_[i] and label_[i+1]), and memcpy that into reserved.

Maybe the compiler will even let you do:

memcpy((uint8_t*)reserved, (uint8_t*)label_1, (uint8_t*)label_2 - (uint8_t*)label_1);

Likely not, though. You can, however, get the proper locations using setjmp or something like __builtin_return_address / _ReturnAddress within a function call.

Note that this will require write access to the instruction memory. Getting that is OS specific, and likely requires su/admin privileges.

like image 26
mtijanic Avatar answered Feb 11 '23 22:02

mtijanic