I have written the below mentioned code. The code checks the first bit of every byte. If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. Here pos points to bytes of an integer. An integer in my implementation is uint64_t and can occupy upto 8 bytes. <pre class="prettyprint"><code>uint64_t func(char* data) { uint64_t var1 = 0; int i=0; while ((data[i] >> 7) == 0) { variable = (variable << 7) | (data[i]); i++; } return variable; } </code></pre> Since I am repeatedly calling func() a trillion times for trillions of integers. Therefore it runs slow, is there a way by which I may optimize this code? EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking.

I have only tested this minimally; I am happy to fix glitches with it. With modern processors, you want to bias your code heavily toward easily predicted branches. And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. That leads me to the following code: <pre class="prettyprint"><code>// fast uleb128 decode // assumes you can read all 10 bytes at *data safely. // assumes standard uleb128 format, with LSB first, and // ... bit 7 indicating "more data in next byte" uint64_t unpack( const uint8_t *const data ) { uint64_t value = ((data[0] & 0x7F ) << 0) | ((data[1] & 0x7F ) << 7) | ((data[2] & 0x7F ) << 14) | ((data[3] & 0x7F ) << 21) | ((data[4] & 0x7Full) << 28) | ((data[5] & 0x7Full) << 35) | ((data[6] & 0x7Full) << 42) | ((data[7] & 0x7Full) << 49) | ((data[8] & 0x7Full) << 56) | ((data[9] & 0x7Full) << 63); if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull; return value; } </code></pre> The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. With a good branch predictor, I think the above code should work pretty well. You might also try removing the <code>else</code> keywords (without changing anything else) to see if that makes a difference. Branch predictors are subtle beasts, and the exact character of your data also matters. If nothing else, you should be able to see that the <code>else</code> keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior. Ultimately, whether or not this approach is effective depends on the distribution of your dataset. If you try out this function, I would be interested to know how it turns out. This particular function focuses on standard <code>uleb128</code>, where the value gets sent LSB first, and bit 7 == 1 means that the data continues. There are SIMD approaches, but none of them lend themselves readily to 7-bit data. Also, if you can mark this <code>inline</code> in a header, then that may also help. It all depends on how many places this gets called from, and whether those places are in a different source file. In general, though, inlining when possible is highly recommended.

how to optimize C++/C code for a large number of integers

Tags:

c++

performance

c

optimization

I have written the below mentioned code. The code checks the first bit of every byte. If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. Here pos points to bytes of an integer. An integer in my implementation is uint64_t and can occupy upto 8 bytes.

uint64_t func(char* data)
{
    uint64_t var1 = 0; int i=0;
    while ((data[i] >> 7) == 0) 
    {
        variable = (variable << 7) | (data[i]);
        i++;
    }   
   return variable; 
}

Since I am repeatedly calling func() a trillion times for trillions of integers. Therefore it runs slow, is there a way by which I may optimize this code?

EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking.

240

asked Jul 08 '13 07:07

Rose Beck

2 Answers

I have only tested this minimally; I am happy to fix glitches with it. With modern processors, you want to bias your code heavily toward easily predicted branches. And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. That leads me to the following code:

// fast uleb128 decode
// assumes you can read all 10 bytes at *data safely.
// assumes standard uleb128 format, with LSB first, and 
// ... bit 7 indicating "more data in next byte"

uint64_t unpack( const uint8_t *const data )
{
    uint64_t value = ((data[0] & 0x7F   ) <<  0)
                   | ((data[1] & 0x7F   ) <<  7)
                   | ((data[2] & 0x7F   ) << 14)
                   | ((data[3] & 0x7F   ) << 21)
                   | ((data[4] & 0x7Full) << 28)
                   | ((data[5] & 0x7Full) << 35)
                   | ((data[6] & 0x7Full) << 42)
                   | ((data[7] & 0x7Full) << 49)
                   | ((data[8] & 0x7Full) << 56)
                   | ((data[9] & 0x7Full) << 63);

    if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else
    if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else
    if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else
    if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else
    if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else
    if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else
    if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else
    if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else
    if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull;

    return value;
}

The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. With a good branch predictor, I think the above code should work pretty well. You might also try removing the else keywords (without changing anything else) to see if that makes a difference. Branch predictors are subtle beasts, and the exact character of your data also matters. If nothing else, you should be able to see that the else keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior.

Ultimately, whether or not this approach is effective depends on the distribution of your dataset. If you try out this function, I would be interested to know how it turns out. This particular function focuses on standard uleb128, where the value gets sent LSB first, and bit 7 == 1 means that the data continues.

There are SIMD approaches, but none of them lend themselves readily to 7-bit data.

Also, if you can mark this inline in a header, then that may also help. It all depends on how many places this gets called from, and whether those places are in a different source file. In general, though, inlining when possible is highly recommended.

145

answered Sep 16 '22 17:09

Joe Z

Your code is problematic

uint64_t func(const unsigned char* pos)
{
    uint64_t var1 = 0; int i=0;
    while ((pos[i] >> 7) == 0) 
    {
        var1 = (var1 << 7) | (pos[i]);
        i++;
    }
    return var1;    
}

First a minor thing: i should be unsigned.

Second: You don't assert that you don't read beyond the boundary of pos. E.g. if all values of your pos array are 0, then you will reach pos[size] where size is the size of the array, hence you invoke undefined behaviour. You should pass the size of your array to the function and check that i is smaller than this size.

Third: If pos[i] has most significant bit equal to zero for i=0,..,k with k>10, then previous work get's discarded (as you push the old value out of var1).

The third point actually helps us:

uint64_t func(const unsigned char* pos, size_t size)
{
    size_t i(0);
    while ( i < size && (pos[i] >> 7) == 0 )
    {
       ++i;
    }
    // At this point, i is either equal to size or
    // i is the index of the first pos value you don't want to use.
    // Therefore we want to use the values
    // pos[i-10], pos[i-9], ..., pos[i-1]
    // if i is less than 10, we obviously need to ignore some of the values
    const size_t start = (i >= 10) ? (i - 10) : 0;
    uint64_t var1 = 0;
    for ( size_t j(start); j < i; ++j )
    {
       var1 <<= 7;
       var1 += pos[j];
    }
    return var1; 
}

In conclusion: We separated logic and got rid of all discarded entries. The speed-up depends on the actual data you have. If lot's of entries are discarded then you save a lot of writes to var1 with this approach.

Another thing: Mostly, if one function is called massively, the best optimization you can do is call it less. Perhaps you can have come up with an additional condition that makes the call of this function useless.

Keep in mind that if you actually use 10 values, the first value ends up the be truncated.

64bit means that there are 9 values with their full 7 bits of information are represented, leaving exactly one bit left foe the tenth. You might want to switch to uint128_t.

answered Sep 16 '22 17:09

stefan

Related questions
                            
                                Global function definition in header file - how to avoid duplicated symbol linkage error
                            
                                How to make a process aware of other processes of the same program
                            
                                How to doxygen comment Qt properties?
                            
                                LoadLibrary taking a LPCTSTR
                            
                                Why is std:: used by experienced coders rather than using namespace std;? [duplicate]
                            
                                Are arrays in C++ same as C?
                            
                                Do I have to return a reference to the object when overloading a pre-increment operator?
                            
                                How to create a list of tuples C++
                            
                                How do I get the number of displays in windows?
                            
                                Why can't I replace std::map with std::unordered_map
                            
                                How to initialize a vector of pointers [closed]
                            
                                What is the best or fastest way to compare two strings?
                            
                                Windows 7 cleans up C++ memory leaks?
                            
                                Calculate the area of an object with OpenCV
                            
                                error: invalid conversion from ‘void*’ to ‘void* (*)(void*)’ - pthreads
                            
                                using \ in a string as literal instead of an escape
                            
                                Eigen MatrixXd push back in c++
                            
                                Constructor arguments from tuple
                            
                                Linker error: undefined reference to symbol 'pthread_rwlock_trywrlock@@GLIBC_2.2.5'
                            
                                QT - Adding widgets to horizontal layout

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With