I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit registers. l was expecting at least a 30%-40% improvement based on the doubled parallelism of the involved operations, however to my surprise, the AVX2 code is a tad slower (around 2%)!
Can someone enlighten me of the possible reasons why I'm not obtaining the expected performance boost?
Unrolled, SSE3 Hamming distance of two 64-byte blocks:
INT32 SSE_PopCount(const UINT32* __restrict pA, const UINT32* __restrict pB) {
__m128i paccum = _mm_setzero_si128();
__m128i a = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pA));
__m128i b = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pB));
__m128i err = _mm_xor_si128 (a, b);
__m128i lo = _mm_and_si128 (err, low_mask);
__m128i hi = _mm_srli_epi16 (err, 4);
hi = _mm_and_si128 (hi, low_mask);
__m128i popcnt1 = _mm_shuffle_epi8(lookup, lo);
__m128i popcnt2 = _mm_shuffle_epi8(lookup, hi);
paccum = _mm_add_epi8(paccum, popcnt1);
paccum = _mm_add_epi8(paccum, popcnt2);
a = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pA + 4));
b = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pB + 4));
err = _mm_xor_si128 (a, b);
lo = _mm_and_si128 (err, low_mask);
hi = _mm_srli_epi16 (err, 4);
hi = _mm_and_si128 (hi, low_mask);
popcnt1 = _mm_shuffle_epi8(lookup, lo);
popcnt2 = _mm_shuffle_epi8(lookup, hi);
paccum = _mm_add_epi8(paccum, popcnt1);
paccum = _mm_add_epi8(paccum, popcnt2);
a = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pA + 8));
b = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pB + 8));
err = _mm_xor_si128 (a, b);
lo = _mm_and_si128 (err, low_mask);
hi = _mm_srli_epi16 (err, 4);
hi = _mm_and_si128 (hi, low_mask);
popcnt1 = _mm_shuffle_epi8(lookup, lo);
popcnt2 = _mm_shuffle_epi8(lookup, hi);
paccum = _mm_add_epi8(paccum, popcnt1);
paccum = _mm_add_epi8(paccum, popcnt2);
a = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pA + 12));
b = _mm_loadu_si128 (reinterpret_cast<const __m128i*>(pB + 12));
err = _mm_xor_si128 (a, b);
lo = _mm_and_si128 (err, low_mask);
hi = _mm_srli_epi16 (err, 4);
hi = _mm_and_si128 (hi, low_mask);
popcnt1 = _mm_shuffle_epi8(lookup, lo);
popcnt2 = _mm_shuffle_epi8(lookup, hi);
paccum = _mm_add_epi8(paccum, popcnt1);
paccum = _mm_add_epi8(paccum, popcnt2);
paccum = _mm_sad_epu8(paccum, _mm_setzero_si128());
UINT64 result = paccum.m128i_u64[0] + paccum.m128i_u64[1];
return (INT32)result;
}
Ununrolled, equivalent version using AVX's 256-bit registers:
INT32 AVX_PopCount(const UINT32* __restrict pA, const UINT32* __restrict pB) {
__m256i paccum = _mm256_setzero_si256();
__m256i a = _mm256_loadu_si256 (reinterpret_cast<const __m256i*>(pA));
__m256i b = _mm256_loadu_si256 (reinterpret_cast<const __m256i*>(pB));
__m256i err = _mm256_xor_si256 (a, b);
__m256i lo = _mm256_and_si256 (err, low_mask256);
__m256i hi = _mm256_srli_epi16 (err, 4);
hi = _mm256_and_si256 (hi, low_mask256);
__m256i popcnt1 = _mm256_shuffle_epi8(lookup256, lo);
__m256i popcnt2 = _mm256_shuffle_epi8(lookup256, hi);
paccum = _mm256_add_epi8(paccum, popcnt1);
paccum = _mm256_add_epi8(paccum, popcnt2);
a = _mm256_loadu_si256 (reinterpret_cast<const __m256i*>(pA + 8));
b = _mm256_loadu_si256 (reinterpret_cast<const __m256i*>(pB + 8));
err = _mm256_xor_si256 (a, b);
lo = _mm256_and_si256 (err, low_mask256);
hi = _mm256_srli_epi16 (err, 4);
hi = _mm256_and_si256 (hi, low_mask256);
popcnt1 = _mm256_shuffle_epi8(lookup256, lo);
popcnt2 = _mm256_shuffle_epi8(lookup256, hi);
paccum = _mm256_add_epi8(paccum, popcnt1);
paccum = _mm256_add_epi8(paccum, popcnt2);
paccum = _mm256_sad_epu8(paccum, _mm256_setzero_si256());
UINT64 result = paccum.m256i_i64[0] + paccum.m256i_u64[1] + paccum.m256i_i64[2] + paccum.m256i_i64[3];
return (INT32)result;
}
I already verified the output assembly code emitted by the compiler and it looks good, with the expected direct translation of the intrinsic instruction to machine instruction. The only thing I noticed is that on the AVX2 version, the last line where the population count of the 4 quad-words is accumulated, it generates more complex code than the SSE3 version (where only 2 quad-words need to be accumulated to obtain the population count), however I would still expect faster throughput.
AVX2 code generated for quad-word accumulation
vextractf128 xmm0, ymm2, 1
psrldq xmm0, 8
movd ecx, xmm2
movd eax, xmm0
vextractf128 xmm0, ymm2, 1
psrldq xmm2, 8
add eax, ecx
movd ecx, xmm0
add eax, ecx
movd ecx, xmm2
add eax, ecx
SSE3 code generated for quad-word accumulation
movd ecx, xmm2
psrldq xmm2, 8
movd eax, xmm2
add eax, ecx
My test program is calling 1 million times each routine, with different input values, but reusing two static buffers for holding the data of the pA
and pB
parameters. In my limited understanding of CPU architecture, this locality (reusing the same memory buffers over and over) should warm up the CPU caches nicely and not be tied by a memory bandwidth problem, but other than possibly memory bandwidth, I can't understand why there is no performance improvement.
Test routine
int _tmain(int argc, _TCHAR* argv[]) {
lookup = _mm_setr_epi8(
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
);
low_mask = _mm_set1_epi8(0xf);
lookup256 = _mm256_setr_epi8(
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
);
low_mask256 = _mm256_set1_epi8(0xf);
std::default_random_engine generator;
generator.seed(37);
std::uniform_int_distribution<UINT32> distribution(0, ULONG_MAX);
auto dice = std::bind( distribution, generator);
UINT32 a[16];
UINT32 b[16];
int count;
count = 0;
{
cout << "AVX PopCount\r\n";
boost::timer::auto_cpu_timer t;
for( int i = 0; i < 1000000; i++ ) {
for( int j = 0; j < 16; j++ ) {
a[j] = dice();
b[j] = dice();
}
count+= AVX_PopCount(a, b);
}
}
cout << count << "\r\n";
std::default_random_engine generator2;
generator2.seed(37);
std::uniform_int_distribution<UINT32> distribution2(0, ULONG_MAX);
auto dice2 = std::bind( distribution2, generator2);
count = 0;
{
cout << "SSE PopCount\r\n";
boost::timer::auto_cpu_timer t;
for( int i = 0; i < 1000000; i++ ) {
for( int j = 0; j < 16; j++ ) {
a[j] = dice2();
b[j] = dice2();
}
count+= SSE_PopCount(a, b);
}
}
cout << count << "\r\n";
getch();
return 0;
}
Test machine is an Intel Corei7 4790, and I'm using Visual Studio 2012 Pro.
In addition to the minor issues in comments (compiling for /arch:AVX
) your primary problem is the generation of the random input arrays on each iteration. This is your bottleneck so your test is not effectively evaluating your methods. Note - I'm not using boost, but GetTickCount
works for this purpose. Consider just :
int count;
count = 0;
{
cout << "AVX PopCount\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 1000000; i++) {
for (int j = 0; j < 16; j++) {
a[j] = dice();
b[j] = dice();
}
count += AVX_PopCount(a, b);
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
produces output :
AVX PopCount
2309
256002470
So 2309ms to complete... but what happens if we get rid of your AVX routine altogether? Just make the input arrays :
int count;
count = 0;
{
cout << "Just making arrays...\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 1000000; i++) {
for (int j = 0; j < 16; j++) {
a[j] = dice();
b[j] = dice();
}
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
produces output:
Just making arrays...
2246
How about that. Not surprising, really, since you're generating 32 random numbers, which can be quite expensive, and then performing only some rather fast integer math and shuffles.
So...
Now let's add a factor of 100 more iterations and get the random generator out of the tight loop. Compiling here with optimizations disabled will run your code as expected and will not throw away the "useless" iterations - presumably the code we care about here is already (manually) optimized!
for (int j = 0; j < 16; j++) {
a[j] = dice();
b[j] = dice();
}
int count;
count = 0;
{
cout << "AVX PopCount\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 100000000; i++) {
count += AVX_PopCount(a, b);
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
cout << count << "\r\n";
count = 0;
{
cout << "SSE PopCount\r\n";
unsigned int Tick = GetTickCount();
for (int i = 0; i < 100000000; i++) {
count += SSE_PopCount(a, b);
}
Tick = GetTickCount() - Tick;
cout << Tick << "\r\n";
}
cout << count << "\r\n";
produces output :
AVX PopCount
3744
730196224
SSE PopCount
5616
730196224
So congratulations - you can pat yourself on the back, your AVX routine is indeed about a third faster than the SSE routine (tested on Haswell i7 here). The lesson is to be sure that you are actually profiling what you think you are profiling!
You should consider using the usual _mm_popcnt_u64
instruction instead of hacking it in SSE or AVX. I tested all methods for popcounting thoroughly, including an SSE and AVX version (which ultimately led to my more or less famous question about popcount). _mm_popcnt_u64
outperforms SSE and AVX considerably, especially when you use a compiler which prevents the Intel popcount bug discovered in my question. Without the bug, my Haswell is able to popcount 26 GB/s which almost hits the bus bandwidth.
The reason why _mm_popcnt_u64
is faster is simply due to the fact that it popcounts 64 bits at once (so already 1/4 of the AVX version) while requiring only one cheap processor instruction. It costs only a few cycles (latency 3, throughput 1 for Intel). Even if every AVX instruction you use required only one cycle, you would still get worse results due to the shear amount of instructions necessary for popcounting 256 bits.
Try this, it should be fastest:
int popcount256(const uint64_t* u){
return _mm_popcnt_u64(u[0]);
+ _mm_popcnt_u64(u[1]);
+ _mm_popcnt_u64(u[2]);
+ _mm_popcnt_u64(u[3]);
}
I know this does not answer your core question why AVX is slower, but since your ultimate goal is fast popcount, the AVX <-> SSE comparison is irrelevant as both are inferior to the builtin popcount.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With