Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly count number of zero-valued bytes in an array

What's a speedy method to count the number of zero-valued bytes in a large, contiguous array? (Or conversely, the number of non-zero bytes.) By large, I mean 216 bytes or larger. The array's position and length can consist of whatever byte alignment.

Naive way:

int countZeroBytes(byte[] values, int length)
{
    int zeroCount = 0;
    for (int i = 0; i < length; ++i)
        if (!values[i])
            ++zeroCount;

    return zeroCount;
}

For my problem, I usually just maintain zeroCount and update it based on specific changes to values. However, I'd like to have a fast, general method of re-computing zeroCount after an arbitrary bulk change to values occurs. I'm sure there's a bit-twiddly method of accomplishing this more quickly, but alas, I'm but a novice twiddler.

EDIT: A few people have asked about the nature of the data being zero-checked, so I'll describe it. (It'd be nice if solutions were still general, though.)

Basically, envision a world composed of voxels (e.g. Minecraft), with procedurally generated terrain segregated into cubic chunks, or effectively pages of memory indexed as three-dimensional arrays. Each voxel is fly-weighted as a unique byte corresponding to a unique material (air, stone, water, etc). Many chunks contain only air or water, while others contain varying combinations of a 2-4 voxels in large quantities (dirt, sand, etc), with effectively 2-10% of voxels being random outliers. Voxels existing in large quantities tend to be highly clustered along every axis.

It seems as though a zero-byte-counting method would be useful in a number of unrelated scenarios, though. Hence, the desire for a general solution.

like image 410
Philip Guin Avatar asked Jan 04 '14 22:01

Philip Guin


2 Answers

This is a special case of How to count character occurrences using SIMD with c=0, the char (byte) value to count matches for. See that Q&A for a well-optimized manually-vectorized AVX2 implementation of char_count (char const* vector, size_t size, char c); with a much tighter inner loop than this, avoiding reducing each vector of 0/-1 matches to scalar separately.


This will go as O(n) so the best you can do is decrease the constant. One quick fix is to remove the branch. This gives a result as fast as my SSE version below if the zeros are randomly distrbuted. This is likely due to the fact the GCC vectorizes this loop. However, for long runs of zeros or for a random density of zeros less than 1% the SSE version below is still faster.

int countZeroBytes_fix(char* values, int length) {
    int zeroCount = 0;
    for(int i=0; i<length; i++) {
        zeroCount += values[i] == 0;
    }
    return zeroCount;
}

I originally thought that the density of zeros would matter. That turns out not to be the case, at least with SSE. Using SSE is a lot faster independent of the density.

Edit: actually, it does depend on the density it just the density of zeros has to be smaller than I expected. 1/64 zeros (1.5% zeros) is one zero in 1/4 SSE registers so the branch prediction does not work very well. However, 1/1024 zeros (0.1% zeros) is faster (see the table of times).

SIMD is even faster if the data has long runs of zeros.

You can pack 16 bytes into a SSE register. Then you can compare all 16 bytes at once with zero using _mm_cmpeq_epi8. Then to handle runs of zero you can use _mm_movemask_epi8 on the result and most of the time it will be zero. You could get a speed up of up to 16 in this case (for first half 1 and second half zero I got over a 12X speedup).

Here is a table of times in seconds for 2^16 bytes (with a repeat of 10000).

                     1.5% zeros  50% zeros  0.1% zeros 1st half 1, 2nd half 0
countZeroBytes       0.8s        0.8s       0.8s        0.95s
countZeroBytes_fix   0.16s       0.16s      0.16s       0.16s
countZeroBytes_SSE   0.2s        0.15s      0.10s       0.07s

You can see the results for last 1/2 zeros at http://coliru.stacked-crooked.com/a/67a169ddb03d907a

#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>                 // SSE2
#include <omp.h>

int countZeroBytes(char* values, int length) {
    int zeroCount = 0;
    for(int i=0; i<length; i++) {
        if (!values[i])
            ++zeroCount;
    }
    return zeroCount;
}

int countZeroBytes_SSE(char* values, int length) {
    int zeroCount = 0;
    __m128i zero16 = _mm_set1_epi8(0);
    __m128i and16 = _mm_set1_epi8(1);
    for(int i=0; i<length; i+=16) {
        __m128i values16 = _mm_loadu_si128((__m128i*)&values[i]);
        __m128i cmp = _mm_cmpeq_epi8(values16, zero16);
        int mask = _mm_movemask_epi8(cmp);
        if(mask) {
            if(mask == 0xffff) zeroCount += 16;
            else {
                cmp = _mm_and_si128(and16, cmp); //change -1 values to 1
                //hortiontal sum of 16 bytes
                __m128i sum1 = _mm_sad_epu8(cmp,zero16);
                __m128i sum2 = _mm_shuffle_epi32(sum1,2);
                __m128i sum3 = _mm_add_epi16(sum1,sum2);
                zeroCount += _mm_cvtsi128_si32(sum3);
            }
        }
    }
    return zeroCount;
}

int main() {
    const int n = 1<<16;
    const int repeat = 10000;
    char *values = (char*)_mm_malloc(n, 16);
    for(int i=0; i<n; i++) values[i] = rand()%64;  //1.5% zeros
    //for(int i=0; i<n/2; i++) values[i] = 1;
    //for(int i=n/2; i<n; i++) values[i] = 0;
    
    int zeroCount = 0;
    double dtime;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) zeroCount = countZeroBytes(values,n);
    dtime = omp_get_wtime() - dtime;
    printf("zeroCount %d, time %f\n", zeroCount, dtime);
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) zeroCount = countZeroBytes_SSE(values,n);
    dtime = omp_get_wtime() - dtime;
    printf("zeroCount %d, time %f\n", zeroCount, dtime);       
}
like image 83
16 revs, 2 users 98% Avatar answered Sep 20 '22 01:09

16 revs, 2 users 98%


I've come with this OpenMP implementation, which may take advantage of the array being in the local cache of each processor to actually read it in parallel.

nzeros_total = 0;
#pragma omp parallel for reduction(+:nzeros_total)
    for (i=0;i<NDATA;i++)
    {
        if (v[i]==0)
            nzeros_total++;
    }

A quick benchmark, consisting on running 1000 times a for loop with a naive implementation (the same the OP has written in the question) versus the OpenMP implementation, running 1000 times too, taking the best time for both methods, with an array of 65536 ints with a zero value element probability of 50%, using Windows 7 on a QuadCore CPU, and compiled with VStudio 2012 Ultimate, yields these numbers:

               DEBUG               RELEASE
Naive method:  580 microseconds.   341 microseconds.
OpenMP method: 159 microseconds.    99 microseconds.

NOTE: I've tried the #pragma loop (hint_parallel(4)) but aparently, this didn't cause the naive version to perform any better so my guess is that the compiler was already applying this optimization, or it couldn't be applied at all. Also, #pragma loop (no_vector) didn't cause the naive version to perform worse.

like image 25
mcleod_ideafix Avatar answered Sep 23 '22 01:09

mcleod_ideafix