Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this function a good candidate for SIMD on Intel?

I'm trying to optimize the following function (simplified a bit, but it's the loop where my program spends a lot of its time):

int f(int len, unsigned char *p) {
  int i = 0;
  while (i < len && p[i] >= 32 && p[i] <= 127) {
      i++;
  }
  return i;
}

I think it could be optimized with vector instructions, but from a bit of research, it looks like SSE is not meant for working at the byte level. This program targets only 64-bit Intel CPUs on OSX. Is there a clever bit-twiddling trick I'm not seeing that would let me work on 64 bits at a time? llvm with -O3 doesn't do any clever optimizations.

UPDATE:

The SIMD code is generally fastest on my benchmark (depending on the size of the input), but for some reason the app overall is slower with SIMD than with the naïve code or the bit-twiddling trick. For context, the application is finding the length of subsequences of ASCII strings in a stream of input to a terminal emulator. ASCII strings get special "fast path" treatment. I could only mark one answer as correct but both were great. I did make one small improvement to the bit twiddling, removing an if statement by doing this:

        while (i < len - 8) {
            uint64_t bytes = *(uint64_t *)(p + i);
            uint64_t middleBits = bytes & 0x6060606060606060;
            uint64_t highBits = bytes & 0x8080808080808080;
            middleBits |= (middleBits >> 1);
            middleBits &= ~(highBits >> 2);
            if ((middleBits & 0x2020202020202020) != 0x2020202020202020) {
                break;
            }
            i += 8;
        }
like image 488
George Avatar asked Mar 06 '14 08:03

George


2 Answers

I am not sure if this is the answer to your question nor if this will considerably speed up your code, but this is an idea coming on my mind. Since 32 equals to 2^5, if a byte is between 32 and 128 it must have 6th or 7th bit set and 8th bit cleared. You can extend testing to 64-bit integers giving me a code like:

// check whether each byte is in range 32 - 128.
unsigned bytesInRange(unsigned long long x) {
    unsigned long long y, z;
    if ((x & 0x8080808080808080LL) != 0) return(0);
    y = x >> 1;
    z = x | y;
    if ((z & 0x2020202020202020LL) == 0x2020202020202020LL) return(1);
    return(0);
}

int f(int len, unsigned char *p) {
  int i = 0;
  int len8 = len / 8;
  unsigned long long *q = (unsigned long long *) p;
  while (i < len8 && bytesInRange(q[i])) {
    i++;
  }

  i = i * 8;
  while (i < len && p[i] >= 32 && p[i] <= 127) {
    i++;
  }
  return i;
}

For architectures where alignment is required it needs to be checked before the first loop.

like image 158
Marian Avatar answered Sep 22 '22 12:09

Marian


You can vectorize the compares using _mm_cmplt_epi8 and _mm_cmpgt_epi8 (msvc intrinsics).

You can then use a movemask on the result of ANDing the compare results. If the result of the movemask is 0xFFFF then all comparisons passed. Otherwise you need to run the tail loop to find out the correct position that failed the test. You could figure this out from the mask, but depending upon the value of 'len' it may not be worth the effort.

The original unvectorized loop for the tail is also required if 'len' is not a multiple of 16. It may or may not be faster -- you'd need to profile it to be sure.

scrap that - the compares operate on signed values and it doesnt work..

Working version below.

union UmmU8 {
    __m128i mm_;
    struct {
        unsigned char u8_;
    };
};

int f(int len, unsigned char *p) {
    int i = 0;
    __m128i A;
    __m128i B;
    __m128i C;
    UmmU8* pu = (UmmU8*)p;    
    int const len16 = len / 16;
    while (i < len16) {
        A = pu[i].mm_;
        B = _mm_slli_epi32(A, 1);
        C = _mm_slli_epi32(A, 2);
        B = _mm_or_si128(B, C);
        A = _mm_andnot_si128(A, B);

        int mask = _mm_movemask_epi8(A);
        if (mask == 0xFFFF) {
            ++i;
        }
        else {
            if (mask == 0) {
                return i * 16;
            }
            break;
        }
    }
    i *= 16;
    while (i < len && p[i] >= 32 && p[i] <= 127) {
        i++;
    }
    return i;
}

Since I don't have a 64 OS on this PC, I cant do a proper perf test. However, a profiling run gave:

  • naive loop: 30.44
  • 64bit integer: 15.22 (on 32bit OS)
  • SSE impl: 5.21

So the SSE version is a lot faster than the naive loop version. I'd expect the 64bit in version to perform significantly better on 64bit system - there may be little difference between the SSE and 64bit versions.

like image 30
Pete Avatar answered Sep 21 '22 12:09

Pete