I've got a case where I need to compress a lot of often small values. Thus I compress them with a variable-length byte encoding (ULEB128, to be specific):
size_t
compress_unsigned_int(unsigned int n, char* data)
{
size_t size = 0;
while (n > 127)
{
++size;
*data++ = (n & 127)|128;
n >>= 7;
}
*data++ = n;
return ++size;
}
Is there a more efficient way to do this (maybe using SSE)?
Edit: After this compression, the result is stored into data
, taking size
bytes. Then, the compression function is called on the next unsigned int.
Variable-length codes can allow sources to be compressed and decompressed with zero error (lossless data compression) and still be read back symbol by symbol. With the right coding strategy an independent and identically-distributed source may be compressed almost arbitrarily close to its entropy.
Variable-Length Encoding (VLE) is a process of reducing input data size by replacing fixed-length data words with codewords of shorter length. As VLE is one of the main building blocks in systems for multimedia compression, its efficient implementation is essential.
Single byte values and ASCIIUTF-8 is a variable length encoding. This means that each code point takes one or more bytes ( u8 values) to be encoded.
You might find fast implementation in google protocol buffers:
http://code.google.com/p/protobuf/
Look at CodedOutputStream::WriteVarintXXX methods.
First method might be rewritten as:
char *start = data;
while (n>=0x80)
{
*data++=(n|0x80);
n>>=7;
}
*data++=n;
return data-start;
According to my test google buffers implementation is the best, then come other implementations. However my test is rather artificial, it is better to test each approach in your application and choose the best. Presented optimizations work better on specific number values.
Here is code of my test application. (Note I've removed code from compress_unsigned_int_google_buf. You might find implementation in the following file from google buffer protocol: coded_stream.cc method CodedOutputStream::WriteVarint32FallbackToArrayInline)
size_t compress_unsigned_int(unsigned int n, char* data)
{
size_t size = 0;
while (n > 127)
{
++size;
*data++ = (n & 127)|128;
n >>= 7;
}
*data++ = n;
return ++size;
}
size_t compress_unsigned_int_improved(unsigned int n, char* data)
{
size_t size;
if (n < 0x00000080U) {
size = 1;
goto b1;
}
if (n < 0x00004000U) {
size = 2;
goto b2;
}
if (n < 0x00200000U) {
size = 3;
goto b3;
}
if (n < 0x10000000U) {
size = 4;
goto b4;
}
size = 5;
*data++ = (n & 0x7f) | 0x80;
n >>= 7;
b4:
*data++ = (n & 0x7f) | 0x80;
n >>= 7;
b3:
*data++ = (n & 0x7f) | 0x80;
n >>= 7;
b2:
*data++ = (n & 0x7f) | 0x80;
n >>= 7;
b1:
*data = n;
return size;
}
size_t compress_unsigned_int_more_improved(unsigned int n, char *data)
{
if (n < (1U << 14)) {
if (n < (1U << 7)) {
data[0] = n;
return 1;
} else {
data[0] = (n >> 7) | 0x80;
data[1] = n & 0x7f;
return 2;
}
} else if (n < (1U << 28)) {
if (n < (1U << 21)) {
data[0] = (n >> 14) | 0x80;
data[1] = ((n >> 7) & 0x7f) | 0x80;
data[2] = n & 0x7f;
return 3;
} else {
data[0] = (n >> 21) | 0x80;
data[1] = ((n >> 14) & 0x7f) | 0x80;
data[2] = ((n >> 7) & 0x7f) | 0x80;
data[3] = n & 0x7f;
return 4;
}
} else {
data[0] = (n >> 28) | 0x80;
data[1] = ((n >> 21) & 0x7f) | 0x80;
data[2] = ((n >> 14) & 0x7f) | 0x80;
data[3] = ((n >> 7) & 0x7f) | 0x80;
data[4] = n & 0x7f;
return 5;
}
}
size_t compress_unsigned_int_simple(unsigned int n, char *data)
{
char *start = data;
while (n>=0x80)
{
*data++=(n|0x80);
n>>=7;
}
*data++=n;
return data-start;
}
inline size_t compress_unsigned_int_google_buf(unsigned int value, unsigned char* target) {
// This implementation might be found in google protocol buffers
}
#include <iostream>
#include <Windows.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
char data[20];
unsigned char udata[20];
size_t size = 0;
__int64 timer;
cout << "Plain copy: ";
timer = GetTickCount64();
size = 0;
for (int i=0; i<536870900; i++)
{
memcpy(data,&i,sizeof(i));
size += sizeof(i);
}
cout << GetTickCount64() - timer << " Size: " << size << endl;
cout << "Original: ";
timer = GetTickCount64();
size = 0;
for (int i=0; i<536870900; i++)
{
size += compress_unsigned_int(i,data);
}
cout << GetTickCount64() - timer << " Size: " << size << endl;
cout << "Improved: ";
timer = GetTickCount64();
size = 0;
for (int i=0; i<536870900; i++)
{
size += compress_unsigned_int_improved(i,data);
}
cout << GetTickCount64() - timer << " Size: " << size << endl;
cout << "More Improved: ";
timer = GetTickCount64();
size = 0;
for (int i=0; i<536870900; i++)
{
size += compress_unsigned_int_more_improved(i,data);
}
cout << GetTickCount64() - timer << " Size: " << size << endl;
cout << "Simple: ";
timer = GetTickCount64();
size = 0;
for (int i=0; i<536870900; i++)
{
size += compress_unsigned_int_simple(i,data);
}
cout << GetTickCount64() - timer << " Size: " << size << endl;
cout << "Google Buffers: ";
timer = GetTickCount64();
size = 0;
for (int i=0; i<536870900; i++)
{
size += compress_unsigned_int_google_buf(i,udata);
}
cout << GetTickCount64() - timer << " Size: " << size << endl;
return 0;
}
On my machine with Visual C++ compiler I've got following results:
Plain copy: 358 ms
Original: 2497 ms
Improved: 2215 ms
More Improved: 2231 ms
Simple: 2059 ms
Google Buffers: 968 ms
The first thing you want to do is test any possible solution against your current code.
I think you may want to try and get rid of data dependencies, to allow the processor to do more work at the same time.
What are data dependencies? As data flows through your function, the current value of n
depends on the previous value of n
, which depends on the value before that... which is a long chain of data dependencies. In the code below, n
is never modified so the processor can "skip ahead" and do a couple different things at the same time without having to wait for the new n
to be computed.
// NOTE: This code is actually incorrect, as caf noted.
// The byte order is reversed.
size_t
compress_unsigned_int(unsigned int n, char *data)
{
if (n < (1U << 14)) {
if (n < (1U << 7)) {
data[0] = n;
return 1;
} else {
data[0] = (n >> 7) | 0x80;
data[1] = n & 0x7f;
return 2;
}
} else if (n < (1U << 28)) {
if (n < (1U << 21)) {
data[0] = (n >> 14) | 0x80;
data[1] = ((n >> 7) & 0x7f) | 0x80;
data[2] = n & 0x7f;
return 3;
} else {
data[0] = (n >> 21) | 0x80;
data[1] = ((n >> 14) & 0x7f) | 0x80;
data[2] = ((n >> 7) & 0x7f) | 0x80;
data[3] = n & 0x7f;
return 4;
}
} else {
data[0] = (n >> 28) | 0x80;
data[1] = ((n >> 21) & 0x7f) | 0x80;
data[2] = ((n >> 14) & 0x7f) | 0x80;
data[3] = ((n >> 7) & 0x7f) | 0x80;
data[4] = n & 0x7f;
return 5;
}
}
I tested the performance by executing it in a tight loop from 0..UINT_MAX. On my system, the execution times are:
(Lower is better)
Original: 100%
caf's unrolled version: 79%
My version: 57%
Some minor tweaking may produce better results, but I doubt you'll get much more improvement unless you go to assembly. If your integers tend to be in specific ranges, then you can use profiling to get the compiler to add the right branch predictions to each branch. This might get you a few extra percentage points of speed. (EDIT: I got 8% from reordering the branches, but it's a perverse optimization because it relies on the fact that each number 0...UINT_MAX appears with equal frequency. I don't recommend this.)
SSE won't help. SSE is designed to operate on multiple pieces of data with the same width at the same time, it is notoriously difficult to get SIMD to accelerate anything with a variable length encoding. (It's not necessarily impossible, but it might be impossible, and you'd have to be pretty smart to figure it out.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With