I've got a case where I need to compress a lot of often small values. Thus I compress them with a variable-length byte encoding (ULEB128, to be specific): <pre class="prettyprint"><code>size_t compress_unsigned_int(unsigned int n, char* data) { size_t size = 0; while (n > 127) { ++size; *data++ = (n & 127)|128; n >>= 7; } *data++ = n; return ++size; } </code></pre> Is there a more efficient way to do this (maybe using SSE)? Edit: After this compression, the result is stored into <code>data</code>, taking <code>size</code> bytes. Then, the compression function is called on the next unsigned int.

You might find fast implementation in google protocol buffers: http://code.google.com/p/protobuf/ Look at CodedOutputStream::WriteVarintXXX methods. First method might be rewritten as: <pre class="prettyprint"><code>char *start = data; while (n>=0x80) { *data++=(n|0x80); n>>=7; } *data++=n; return data-start; </code></pre> According to my test google buffers implementation is the best, then come other implementations. However my test is rather artificial, it is better to test each approach in your application and choose the best. Presented optimizations work better on specific number values. Here is code of my test application. (Note I've removed code from compress_unsigned_int_google_buf. You might find implementation in the following file from google buffer protocol: coded_stream.cc method CodedOutputStream::WriteVarint32FallbackToArrayInline) <pre class="prettyprint"><code>size_t compress_unsigned_int(unsigned int n, char* data) { size_t size = 0; while (n > 127) { ++size; *data++ = (n & 127)|128; n >>= 7; } *data++ = n; return ++size; } size_t compress_unsigned_int_improved(unsigned int n, char* data) { size_t size; if (n < 0x00000080U) { size = 1; goto b1; } if (n < 0x00004000U) { size = 2; goto b2; } if (n < 0x00200000U) { size = 3; goto b3; } if (n < 0x10000000U) { size = 4; goto b4; } size = 5; *data++ = (n & 0x7f) | 0x80; n >>= 7; b4: *data++ = (n & 0x7f) | 0x80; n >>= 7; b3: *data++ = (n & 0x7f) | 0x80; n >>= 7; b2: *data++ = (n & 0x7f) | 0x80; n >>= 7; b1: *data = n; return size; } size_t compress_unsigned_int_more_improved(unsigned int n, char *data) { if (n < (1U << 14)) { if (n < (1U << 7)) { data[0] = n; return 1; } else { data[0] = (n >> 7) | 0x80; data[1] = n & 0x7f; return 2; } } else if (n < (1U << 28)) { if (n < (1U << 21)) { data[0] = (n >> 14) | 0x80; data[1] = ((n >> 7) & 0x7f) | 0x80; data[2] = n & 0x7f; return 3; } else { data[0] = (n >> 21) | 0x80; data[1] = ((n >> 14) & 0x7f) | 0x80; data[2] = ((n >> 7) & 0x7f) | 0x80; data[3] = n & 0x7f; return 4; } } else { data[0] = (n >> 28) | 0x80; data[1] = ((n >> 21) & 0x7f) | 0x80; data[2] = ((n >> 14) & 0x7f) | 0x80; data[3] = ((n >> 7) & 0x7f) | 0x80; data[4] = n & 0x7f; return 5; } } size_t compress_unsigned_int_simple(unsigned int n, char *data) { char *start = data; while (n>=0x80) { *data++=(n|0x80); n>>=7; } *data++=n; return data-start; } inline size_t compress_unsigned_int_google_buf(unsigned int value, unsigned char* target) { // This implementation might be found in google protocol buffers } #include <iostream> #include <Windows.h> using namespace std; int _tmain(int argc, _TCHAR* argv[]) { char data[20]; unsigned char udata[20]; size_t size = 0; __int64 timer; cout << "Plain copy: "; timer = GetTickCount64(); size = 0; for (int i=0; i<536870900; i++) { memcpy(data,&i,sizeof(i)); size += sizeof(i); } cout << GetTickCount64() - timer << " Size: " << size << endl; cout << "Original: "; timer = GetTickCount64(); size = 0; for (int i=0; i<536870900; i++) { size += compress_unsigned_int(i,data); } cout << GetTickCount64() - timer << " Size: " << size << endl; cout << "Improved: "; timer = GetTickCount64(); size = 0; for (int i=0; i<536870900; i++) { size += compress_unsigned_int_improved(i,data); } cout << GetTickCount64() - timer << " Size: " << size << endl; cout << "More Improved: "; timer = GetTickCount64(); size = 0; for (int i=0; i<536870900; i++) { size += compress_unsigned_int_more_improved(i,data); } cout << GetTickCount64() - timer << " Size: " << size << endl; cout << "Simple: "; timer = GetTickCount64(); size = 0; for (int i=0; i<536870900; i++) { size += compress_unsigned_int_simple(i,data); } cout << GetTickCount64() - timer << " Size: " << size << endl; cout << "Google Buffers: "; timer = GetTickCount64(); size = 0; for (int i=0; i<536870900; i++) { size += compress_unsigned_int_google_buf(i,udata); } cout << GetTickCount64() - timer << " Size: " << size << endl; return 0; } </code></pre> On my machine with Visual C++ compiler I've got following results: Plain copy: 358 ms Original: 2497 ms Improved: 2215 ms More Improved: 2231 ms Simple: 2059 ms Google Buffers: 968 ms

The first thing you want to do is test any possible solution against your current code. I think you may want to try and get rid of data dependencies, to allow the processor to do more work at the same time. What are data dependencies? As data flows through your function, the current value of <code>n</code> depends on the previous value of <code>n</code>, which depends on the value before that... which is a long chain of data dependencies. In the code below, <code>n</code> is never modified so the processor can "skip ahead" and do a couple different things at the same time without having to wait for the new <code>n</code> to be computed. <pre class="prettyprint"><code>// NOTE: This code is actually incorrect, as caf noted. // The byte order is reversed. size_t compress_unsigned_int(unsigned int n, char *data) { if (n < (1U << 14)) { if (n < (1U << 7)) { data[0] = n; return 1; } else { data[0] = (n >> 7) | 0x80; data[1] = n & 0x7f; return 2; } } else if (n < (1U << 28)) { if (n < (1U << 21)) { data[0] = (n >> 14) | 0x80; data[1] = ((n >> 7) & 0x7f) | 0x80; data[2] = n & 0x7f; return 3; } else { data[0] = (n >> 21) | 0x80; data[1] = ((n >> 14) & 0x7f) | 0x80; data[2] = ((n >> 7) & 0x7f) | 0x80; data[3] = n & 0x7f; return 4; } } else { data[0] = (n >> 28) | 0x80; data[1] = ((n >> 21) & 0x7f) | 0x80; data[2] = ((n >> 14) & 0x7f) | 0x80; data[3] = ((n >> 7) & 0x7f) | 0x80; data[4] = n & 0x7f; return 5; } } </code></pre> I tested the performance by executing it in a tight loop from 0..UINT_MAX. On my system, the execution times are: <pre class="prettyprint"><code>(Lower is better) Original: 100% caf's unrolled version: 79% My version: 57% </code></pre> Some minor tweaking may produce better results, but I doubt you'll get much more improvement unless you go to assembly. If your integers tend to be in specific ranges, then you can use profiling to get the compiler to add the right branch predictions to each branch. This might get you a few extra percentage points of speed. (EDIT: I got 8% from reordering the branches, but it's a perverse optimization because it relies on the fact that each number 0...UINT_MAX appears with equal frequency. I don't recommend this.) SSE won't help. SSE is designed to operate on multiple pieces of data with the same width at the same time, it is notoriously difficult to get SIMD to accelerate anything with a variable length encoding. (It's not necessarily impossible, but it might be impossible, and you'd have to be pretty smart to figure it out.)

Optimizing variable-length encoding

Tags:

c++

c

assembly

sse

I've got a case where I need to compress a lot of often small values. Thus I compress them with a variable-length byte encoding (ULEB128, to be specific):

size_t
compress_unsigned_int(unsigned int n, char* data)
{
  size_t size = 0;
  while (n  > 127)
  {
    ++size;
    *data++ = (n & 127)|128;
    n >>= 7;
  }
  *data++ = n;
  return ++size;
}

Is there a more efficient way to do this (maybe using SSE)?

Edit: After this compression, the result is stored into data, taking size bytes. Then, the compression function is called on the next unsigned int.

330

asked May 02 '11 14:05

Alexandre Hamez

2 Answers

You might find fast implementation in google protocol buffers:

http://code.google.com/p/protobuf/

Look at CodedOutputStream::WriteVarintXXX methods.

First method might be rewritten as:

char *start = data;
while (n>=0x80)
{
    *data++=(n|0x80);
    n>>=7;
}
*data++=n;
return data-start;

According to my test google buffers implementation is the best, then come other implementations. However my test is rather artificial, it is better to test each approach in your application and choose the best. Presented optimizations work better on specific number values.

Here is code of my test application. (Note I've removed code from compress_unsigned_int_google_buf. You might find implementation in the following file from google buffer protocol: coded_stream.cc method CodedOutputStream::WriteVarint32FallbackToArrayInline)

size_t compress_unsigned_int(unsigned int n, char* data)
{
    size_t size = 0;
    while (n  > 127)
    {
        ++size;
        *data++ = (n & 127)|128;
        n >>= 7;
    }
    *data++ = n;
    return ++size;
}

size_t compress_unsigned_int_improved(unsigned int n, char* data)
{
    size_t size;

    if (n < 0x00000080U) {
        size = 1;
        goto b1;
    }
    if (n < 0x00004000U) {
        size = 2;
        goto b2;
    }
    if (n < 0x00200000U) {
        size = 3;
        goto b3;
    }
    if (n < 0x10000000U) {
        size = 4;
        goto b4;
    }
    size = 5;

    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b4:
    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b3:
    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b2:
    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b1:
    *data = n;
    return size;
}

size_t compress_unsigned_int_more_improved(unsigned int n, char *data)
{
    if (n < (1U << 14)) {
        if (n < (1U << 7)) {
            data[0] = n;
            return 1;
        } else {
            data[0] = (n >> 7) | 0x80;
            data[1] = n & 0x7f;
            return 2;
        }
    } else if (n < (1U << 28)) {
        if (n < (1U << 21)) {
            data[0] = (n >> 14) | 0x80;
            data[1] = ((n >> 7) & 0x7f) | 0x80;
            data[2] = n & 0x7f;
            return 3;
        } else {
            data[0] = (n >> 21) | 0x80;
            data[1] = ((n >> 14) & 0x7f) | 0x80;
            data[2] = ((n >> 7) & 0x7f) | 0x80;
            data[3] = n & 0x7f;
            return 4;
        }
    } else {
        data[0] = (n >> 28) | 0x80;
        data[1] = ((n >> 21) & 0x7f) | 0x80;
        data[2] = ((n >> 14) & 0x7f) | 0x80;
        data[3] = ((n >> 7) & 0x7f) | 0x80;
        data[4] = n & 0x7f;
        return 5;
    }
}

size_t compress_unsigned_int_simple(unsigned int n, char *data)
{
    char *start = data;
    while (n>=0x80)
    {
        *data++=(n|0x80);
        n>>=7;
    }
    *data++=n;
    return data-start;
}

inline size_t compress_unsigned_int_google_buf(unsigned int value, unsigned char* target) {

          // This implementation might be found in google protocol buffers

}



#include <iostream>
#include <Windows.h>
using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
    char data[20];
    unsigned char udata[20];
    size_t size = 0;
    __int64 timer;

    cout << "Plain copy: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        memcpy(data,&i,sizeof(i));
        size += sizeof(i);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "Original: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size << endl;

    cout << "Improved: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_improved(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "More Improved: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_more_improved(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "Simple: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_simple(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "Google Buffers: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_google_buf(i,udata);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    return 0;
}

On my machine with Visual C++ compiler I've got following results:

Plain copy: 358 ms

Original: 2497 ms

Improved: 2215 ms

More Improved: 2231 ms

Simple: 2059 ms

Google Buffers: 968 ms

answered Oct 12 '22 15:10

Petro

The first thing you want to do is test any possible solution against your current code.

I think you may want to try and get rid of data dependencies, to allow the processor to do more work at the same time.

What are data dependencies? As data flows through your function, the current value of n depends on the previous value of n, which depends on the value before that... which is a long chain of data dependencies. In the code below, n is never modified so the processor can "skip ahead" and do a couple different things at the same time without having to wait for the new n to be computed.

// NOTE: This code is actually incorrect, as caf noted.
// The byte order is reversed.
size_t
compress_unsigned_int(unsigned int n, char *data)
{
    if (n < (1U << 14)) {
        if (n < (1U << 7)) {
            data[0] = n;
            return 1;
        } else {
            data[0] = (n >> 7) | 0x80;
            data[1] = n & 0x7f;
            return 2;
        }
    } else if (n < (1U << 28)) {
        if (n < (1U << 21)) {
            data[0] = (n >> 14) | 0x80;
            data[1] = ((n >> 7) & 0x7f) | 0x80;
            data[2] = n & 0x7f;
            return 3;
        } else {
            data[0] = (n >> 21) | 0x80;
            data[1] = ((n >> 14) & 0x7f) | 0x80;
            data[2] = ((n >> 7) & 0x7f) | 0x80;
            data[3] = n & 0x7f;
            return 4;
        }
    } else {
        data[0] = (n >> 28) | 0x80;
        data[1] = ((n >> 21) & 0x7f) | 0x80;
        data[2] = ((n >> 14) & 0x7f) | 0x80;
        data[3] = ((n >> 7) & 0x7f) | 0x80;
        data[4] = n & 0x7f;
        return 5;
    }
}

I tested the performance by executing it in a tight loop from 0..UINT_MAX. On my system, the execution times are:

(Lower is better)
Original: 100%
caf's unrolled version: 79%
My version: 57%

Some minor tweaking may produce better results, but I doubt you'll get much more improvement unless you go to assembly. If your integers tend to be in specific ranges, then you can use profiling to get the compiler to add the right branch predictions to each branch. This might get you a few extra percentage points of speed. (EDIT: I got 8% from reordering the branches, but it's a perverse optimization because it relies on the fact that each number 0...UINT_MAX appears with equal frequency. I don't recommend this.)

SSE won't help. SSE is designed to operate on multiple pieces of data with the same width at the same time, it is notoriously difficult to get SIMD to accelerate anything with a variable length encoding. (It's not necessarily impossible, but it might be impossible, and you'd have to be pretty smart to figure it out.)

answered Oct 12 '22 13:10

Dietrich Epp

Related questions
                            
                                Are function-local typedefs visible inside C++0x lambdas?
                            
                                When I kill a pThread in C++, do destructors of objects on stacks get called?
                            
                                serialize in .NET, deserialize in C++
                            
                                How do I use a manipulator to format my hex output with padded left zeros
                            
                                Heuristic to identify if a series of 4 bytes chunks of data are integers or floats
                            
                                Trying to 'Make' CUDA SDK, ld cannot find library, ldconfig says it can
                            
                                Difference between Locks, Mutex and Critical Sections
                            
                                What does collect2.exe do?
                            
                                Thread safety with heap-allocated memory
                            
                                When does GCC define NDEBUG? [duplicate]
                            
                                Doxygen C++ conventions
                            
                                Is it possible to read an LLVM bitcode file into an llvm::Module?
                            
                                Array and Rvalue
                            
                                Using Doxygen with Visual Studio 2010
                            
                                C++ Abstract Factory using templates
                            
                                where is c++filt source code?
                            
                                What is the preferred way to implement a factory method in C++?
                            
                                compile-time string hashing
                            
                                What's the use of C4711 "function selected for inline expansion" Visual C++ warning?
                            
                                Sequence Points and Method Chaining

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With