I've written an algorithm that does multiple single precision operations in parallel using Intel intrinsic functions. The result of each iteration of my algorithm is the number of nonzero entries in a single 256 bit vector (__m256).
For example:
 00000000  FFFFFFFF  00000000  00000000  00000000  FFFFFFFF  FFFFFFFF  FFFFFFFF
where the result of the iteration is 4.
What is the fastest way to count the number nonzero entries in the vector?
Currently I'm doing something like this:
float results[8];
_mm256_storeu_ps(results, result_vector);
int count = 0;
for (uint32_t idx = 0; idx < 8; ++idx)
{
    if (results[idx] != 0)
    {            
        ++count;
    }
}
This approach works just fine but I wonder if there is a more efficient way to do it, perhaps one that doesn't involve a store.
The hardware popcnt instruction is your best bet here.  It's fast, and vmovmskps is also very efficient for giving you the high bit of each element as an integer bitmask.  (compare / movemask is a standard way to branch on a vector compare result, or use it to index a lookup table of shuffle masks).
movemask / popcnt can be useful when left-packing, to increment a destination pointer by the number of elements you stored (after shuffling).
#include <immintrin.h>
// use only with compare-results.
// or to count elements with their sign-bit set
unsigned count_true(__m256 v) {
    unsigned mask = _mm256_movemask_ps(v);
    return _mm_popcnt_u32(mask);
}
popcnt has a separate feature-bit from AVX, so in theory there could be a CPU (or virtual machine) with AVX but not hardware popcnt, but in practice I wouldn't worry about it.  (popcnt was introduced with SSE4.2, and AVX implies SSE4.2)
Even if you want the result in a vector register for something, vmovmskps / popcnt / movd is probably a better sequence than horizontally adding the 0 / -1 elements with integer adds.  That would take 3 shuffle/add steps to reduce 8 elements down to 1, and you'd have a negative sum.
I mostly mention this because treating compare results as integer 0 / -1 is useful in some cases.  e.g. to conditionally increment a vector of counters, cmpps / psubd does the trick.  (0 + x = x, so the false elements are unchanged.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With