Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast counting the number of equal bytes between two arrays [duplicate]

Tags:

c++

c

simd

sse

sse2

I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison.

Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16 bytes, so I need deal with this problem. How could I complete the implementation of the function below? How could I improve the function below?

int fast_compare(const char* s, const char* t, int length)
{
    int result = 0;

    const char* sPtr = s;
    const char* tPtr = t;

    while(...)
    {
        const __m128i* lhs = (const __m128i*)sPtr;
        const __m128i* rhs = (const __m128i*)tPtr;

        // compare the next 16 bytes of s and t
        result += compare_16bytes(*lhs,*rhs);

        sPtr += 16;
        tPtr += 16;
    }

    return result;
}
like image 300
enzom83 Avatar asked Mar 09 '13 17:03

enzom83


2 Answers

As @Mysticial says in the comments above, do the compare and sum vertically and then just sum horizontally at the end of the main loop:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <emmintrin.h>

// reference implementation
int fast_compare_ref(const char *s, const char *t, int length)
{
    int result = 0;
    int i;

    for (i = 0; i < length; ++i)
    {
        if (s[i] == t[i])
            result++;
    }
    return result;
}

// optimised implementation
int fast_compare(const char *s, const char *t, int length)
{
    int result = 0;
    int i;

    __m128i vsum = _mm_set1_epi32(0);
    for (i = 0; i < length - 15; i += 16)
    {
        __m128i vs, vt, v, vh, vl, vtemp;

        vs = _mm_loadu_si128((__m128i *)&s[i]); // load 16 chars from input
        vt = _mm_loadu_si128((__m128i *)&t[i]);
        v = _mm_cmpeq_epi8(vs, vt);             // compare
        vh = _mm_unpackhi_epi8(v, v);           // unpack compare result into 2 x 8 x 16 bit vectors
        vl = _mm_unpacklo_epi8(v, v);
        vtemp = _mm_madd_epi16(vh, vh);         // accumulate 16 bit vectors into 4 x 32 bit partial sums
        vsum = _mm_add_epi32(vsum, vtemp);
        vtemp = _mm_madd_epi16(vl, vl);
        vsum = _mm_add_epi32(vsum, vtemp);
    }

    // get sum of 4 x 32 bit partial sums
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    result = _mm_cvtsi128_si32(vsum);

    // handle any residual bytes ( < 16)
    if (i < length)
    {
        result += fast_compare_ref(&s[i], &t[i], length - i);
    }

    return result;
}

// test harness
int main(void)
{
    const int n = 1000000;
    char *s = malloc(n);
    char *t = malloc(n);
    int i, result_ref, result;

    srand(time(NULL));

    for (i = 0; i < n; ++i)
    {
        s[i] = rand();
        t[i] = rand();
    }

    result_ref = fast_compare_ref(s, t, n);
    result = fast_compare(s, t, n);

    printf("result_ref = %d, result = %d\n", result_ref, result);;

    return 0;
}

Compile and run the above test harness:

$ gcc -Wall -O3 -msse3 fast_compare.c -o fast_compare
$ ./fast_compare
result_ref = 3955, result = 3955
$ ./fast_compare
result_ref = 3947, result = 3947
$ ./fast_compare
result_ref = 3945, result = 3945

Note that there is one possibly non-obvious trick in the above SSE code where we use _mm_madd_epi16 to unpack and accumulate 16 bit 0/-1 values to 32 bit partial sums. We take advantage of the fact that -1*-1 = 1 (and 0*0 = 0 of course) - we're not really doing a multiply here, just unpacking and summing in one instruction.


UPDATE: as noted in the comments below, this solution is not optimal - I just took a fairly optimal 16 bit solution and added 8 bit to 16 bit unpacking to make it work for 8 bit data. However for 8 bit data there are more efficient methods, e.g. using psadbw/_mm_sad_epu8. I'll leave this answer here for posterity, and for anyone who might want to do this kind of thing with 16 bit data, but really one of the other answers which doesn't require unpacking the input data should be the accepted answer.

like image 181
Paul R Avatar answered Oct 31 '22 00:10

Paul R


Using partial sums in 16 x uint8 elements may give even better performance.
I have divided the loop into inner loop and outer loop.
The inner loop sum uint8 elements (each uint8 element can sum up to 255 "1"s).
Small trick: _mm_cmpeq_epi8 set equal elements to 0xFF, and (char)0xFF = -1, so you can subtract the result from the sum (subtract -1 for adding 1).

Here is my optimized version for fast_compare:

int fast_compare2(const char *s, const char *t, int length)
{
    int result = 0;
    int inner_length = length;
    int i;
    int j = 0;

    //Points beginning of 4080 elements block.
    const char *s0 = s;
    const char *t0 = t;


    __m128i vsum = _mm_setzero_si128();

    //Outer loop sum result of 4080 sums.
    for (i = 0; i < length; i += 4080)
    {
        __m128i vsum_uint8 = _mm_setzero_si128(); //16 uint8 sum elements (each uint8 element can sum up to 255).
        __m128i vh, vl, vhl, vhl_lo, vhl_hi;

        //Points beginning of 4080 elements block.
        s0 = s + i;
        t0 = t + i;

        if (i + 4080 <= length)
        {
            inner_length = 4080;
        }
        else
        {
            inner_length = length - i;
        }

        //Inner loop - sum up to 4080 (compared) results.
        //Each uint8 element can sum up to 255. 16 uint8 elements can sum up to 255*16 = 4080 (compared) results.
        //////////////////////////////////////////////////////////////////////////
        for (j = 0; j < inner_length-15; j += 16)
        {
              __m128i vs, vt, v;

              vs = _mm_loadu_si128((__m128i *)&s0[j]); // load 16 chars from input
              vt = _mm_loadu_si128((__m128i *)&t0[j]);
              v = _mm_cmpeq_epi8(vs, vt);             // compare - set to 0xFF where equal, and 0 otherwise.

              //Consider this: (char)0xFF = (-1)
              vsum_uint8 = _mm_sub_epi8(vsum_uint8, v); //Subtract the comparison result - subtract (-1) where equal.
        }
        //////////////////////////////////////////////////////////////////////////

        vh = _mm_unpackhi_epi8(vsum_uint8, _mm_setzero_si128());        // unpack result into 2 x 8 x 16 bit vectors
        vl = _mm_unpacklo_epi8(vsum_uint8, _mm_setzero_si128());
        vhl = _mm_add_epi16(vh, vl);    //Sum high and low as uint16 elements.

        vhl_hi = _mm_unpackhi_epi16(vhl, _mm_setzero_si128());   //unpack sum of vh an vl into 2 x 4 x 32 bit vectors
        vhl_lo = _mm_unpacklo_epi16(vhl, _mm_setzero_si128());   //unpack sum of vh an vl into 2 x 4 x 32 bit vectors

        vsum = _mm_add_epi32(vsum, vhl_hi);
        vsum = _mm_add_epi32(vsum, vhl_lo);
    }

    // get sum of 4 x 32 bit partial sums
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    result = _mm_cvtsi128_si32(vsum);

    // handle any residual bytes ( < 16)
    if (j < inner_length)
    {
        result += fast_compare_ref(&s0[j], &t0[j], inner_length - j);
    }

    return result;
}
like image 35
Rotem Avatar answered Oct 31 '22 00:10

Rotem