How to set up repetitive data so that most can be optimized away?

Question

I need to perform a bitwise AND on 32 kbit-wide data. One of these values is a fixed bitmask.

I'm performing this AND 32 bits at a time. Simplified, my algorithm will look something like this:

_{(I'm removing memory management, variable scope concerns, etc, from this example)}

#include <stdint.h>

const uint32_t mask[1024] = {
            0b00110110100101100111001011000111,
            0b10001110100101111010010100100100,
            0b11101010010000110001101010010101,
            0b10001110100101111010010100100100,
            (...) // 1019 more lines!
            0b00110110100101100111001011000111};

uint32_t answer[1024] = {0};
uint32_t workingdata = 0;
uint16_t i = 0;

int main(void)
{
    for (i=0; i<1024; i++)
    {
        workingdata = getnextdatachunk();
        answer[i] = workingdata & mask[i];
    }

    do_something_with_answer();

    return 0;
}

Here's the thing: If you look at the example bitmask, mask[1] == mask[3] and mask[0] == mask[1023].

In my actual bitmask most of the values are repeats; there are only 20 different values in the entire 1024-value array. Also, in my final application I have 16 different bitmasks, each with similar internal repetition.

I am looking for a good method to avoid having to store and iterate through so much unnecessary data.

One method I've considered is similar to a lookup table, where my array only contains a single instance of each desired chunk of bitmask:

const uint32_t mask[20] = {
            0b00110110100101100111001011000111,
            0b10001110100101111010010100100100,
            (...) // only 17 more lines!
            0b11101010010000110001101010010101};

uint32_t answer[1024] = {0};
uint32_t workingdata = 0;
uint16_t i = 0;

int main(void)
{
    for (i=0; i<1024; i++)
    {
        workingdata = getnextdata();

        switch(i)
        {
            // the mask indexes are precalculated:

            case 0:
                answer[i] = workingdata & mask[5];
                break;
            case 1:
                answer[i] = workingdata & mask[2];
                break;
            case 2:
                answer[i] = workingdata & mask[2];
                break;
            case 3:
                answer[i] = workingdata & mask[0];
                break;
            case (...): // 1020 more cases!
                (...);
                break;
            default:
        }
    }

    do_something_with_answer();

    return 0;
}

Or, with a more compact switch statement:

switch(i)
{
    // the mask indexes are precalculated:

    case 0,3,4,5,18,35,67,(...),1019:
        answer[i] = workingdata & mask[0];
        break;
    case 1,15,16,55,89,91,(...),1004:
        answer[i] = workingdata & mask[1];
        break;
    case (...): // Only 18 more cases!
        (...);
        break;
    default:
}

Both of these solutions really make it unclear what's happening, which I'd really like to avoid.

Ideally, I would like to keep the original structure and have gcc's optimizer do away with all the unnecessary data. How can I keep my code well-written and still be efficient?

Brendan · Accepted Answer

Let's invent a system of points, and pretend that a data fetch from L1 cache costs 4 points, a fetch from L2 cache costs 8 points and an unpredictable branch costs 12 points. Note that these points are chosen to crudely represent "cycles for an average but unknown 80x86 CPU".

The original code with a single 1024 entry table would have a total cost of 4 points per iteration (assuming its done often enough for performance to matter and therefore assuming the data is used often enough to be in the L1 cache).

With a switch statement, the compiler is going to (hopefully - a series if branches is a performance nightmare) convert it into a jump table and do something like goto table[i]; so it probably counts as fetching from a table (4 points) followed by one unpredictable branch (12 points); or a total of 16 points per iteration.

Note that for 64-bit code, the jump table that the compiler generates will be 1024 entries where each entry is 64 bits; and this table will be twice as large as the table for the first option (which is 1024 entries where each entry is 32 bits). However, the L1 data caches in a lot of CPUs are 64 KiB, so a 64 KiB jump table means anything else that goes into L1 data cache (the source data being ANDed, the resulting "answer" data, anything on the CPU's stack) causes (64 byte or 8 entry) pieces of your jump table to be evicted from the cache to make room. This means that sometimes you'll be paying for "L1 miss, L2 hit". Let's assume this happens 5% of the time, so the real cost ends up being "(95 * (4+12) + 5 * (8+12) ) / 100 = 16.2" points per iteration.

Given that you'd expect performance to be better for the first option ("16.2 points per iteration" is significantly larger than "4 points per iteration"), and that you'd expect executable size to be better for the first option (even without considering any of the code for each case of the switch, a 32 KiB table is half the size of a 64 KiB table), and given that the first option has simpler (more maintainable) code; I can't see a single reason why you'd want to use the second option.

To optimise this code, I'd try to work on larger pieces. For a simple example, can you do something like this:

    uint64_t mask[512] = { ....

    uint64_t workingdata;
    uint64_t temp;

    for (i=0; i<512; i++)
    {
        workingdata = getnextdatachunk() << 32 | getnextdatachunk();
        temp = workingdata & mask[i];
        answer[i*2] = temp;
        answer[i*2+1] = temp >> 32;
    }

If you can do something like this, then it might (at best) double the performance; but if you can do "64 bits per iteration for half as many iterations" you might also be able to use SIMD intrinsics to do "128 bits per iteration for a quarter as many iterations" or "256 bits per iteration for an eighth as many iterations", and might be able to make it almost 8 times faster.

Of course the step beyond that is to buffer enough of the source data to make using multiple threads (multiple CPUs) efficient (e.g. so that the synchronization costs can be effectively amortized). With 4 CPUs pounding away in parallel doing 256 bits per iteration each you'd get a (theoretical best case) speedup of "32 times faster than the original 1024 iterations at 32-bit per iteration with single CPU version".

Afshin · Answer

I personally believe that your approach is really depend on your use case. You have 2 different modes:

If running speed is important, keep array as whole in memory (considering array will not become too big to mess up with cache).
If code size is important, use methods like what you have thought or what PSkocik suggested.

For selecting a proper code design, you need to consider a lot of different factors. For example, If your code will run on a embedded device, I will probably go with smaller code size approach. But if code if you a normal PC, I probably go with the first one.

How to set up repetitive data so that most can be optimized away?

Tags:

c

optimization

gcc

bitsmack

2 Answers

Brendan

Afshin

Recent Activity

Donate For Us

How to set up repetitive data so that most can be optimized away?

Tags:

c

optimization

gcc

bitsmack

2 Answers

Brendan

Afshin

Related questions

Recent Activity

Donate For Us