I've written an algorithm that does multiple single precision operations in parallel using Intel intrinsic functions. The result of each iteration of my algorithm is the number of nonzero entries in a single 256 bit vector (__m256
).
For example:
00000000 FFFFFFFF 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
where the result of the iteration is 4.
What is the fastest way to count the number nonzero entries in the vector?
Currently I'm doing something like this:
float results[8];
_mm256_storeu_ps(results, result_vector);
int count = 0;
for (uint32_t idx = 0; idx < 8; ++idx)
{
if (results[idx] != 0)
{
++count;
}
}
This approach works just fine but I wonder if there is a more efficient way to do it, perhaps one that doesn't involve a store.
The hardware popcnt
instruction is your best bet here. It's fast, and vmovmskps
is also very efficient for giving you the high bit of each element as an integer bitmask. (compare / movemask is a standard way to branch on a vector compare result, or use it to index a lookup table of shuffle masks).
movemask / popcnt can be useful when left-packing, to increment a destination pointer by the number of elements you stored (after shuffling).
#include <immintrin.h>
// use only with compare-results.
// or to count elements with their sign-bit set
unsigned count_true(__m256 v) {
unsigned mask = _mm256_movemask_ps(v);
return _mm_popcnt_u32(mask);
}
popcnt
has a separate feature-bit from AVX, so in theory there could be a CPU (or virtual machine) with AVX but not hardware popcnt
, but in practice I wouldn't worry about it. (popcnt
was introduced with SSE4.2, and AVX implies SSE4.2)
Even if you want the result in a vector register for something, vmovmskps / popcnt / movd is probably a better sequence than horizontally adding the 0
/ -1
elements with integer adds. That would take 3 shuffle/add steps to reduce 8 elements down to 1, and you'd have a negative sum.
I mostly mention this because treating compare results as integer 0
/ -1
is useful in some cases. e.g. to conditionally increment a vector of counters, cmpps
/ psubd
does the trick. (0 + x = x
, so the false elements are unchanged.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With