Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

linear search through uint64[] with SSE

i'm trying to implement a linear search through an array of uint64 using SSE instructions. I got things working for uint16 and uint32, but i get compiler errors for the uint64 code (linux, gcc - see specs at the end).

I'm trying to compare 2x2 64bit numbers and then somehow translate the result in an index for my array. This works well with uint32 (credits go to http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/):

#include <xmmintrin.h>
#include <smmintrin.h>

typedef ham_u64_t vec2uint64 __attribute__ ((vector_size (16)));
typedef ham_u32_t vec4uint32 __attribute__ ((vector_size (16)));
typedef float     vec4float  __attribute__ ((vector_size (16)));
typedef ham_u16_t vec8uint16 __attribute__ ((vector_size (16)));
typedef ham_u8_t  vec16uint8 __attribute__ ((vector_size (16)));

// ...

vec4uint32 v1 = _mm_loadu_si128((const __m128i *)&data[start + i + 0]);
vec4uint32 v2 = _mm_loadu_si128((const __m128i *)&data[start + i + 4]);
vec4uint32 v3 = _mm_loadu_si128((const __m128i *)&data[start + i + 8]);
vec4uint32 v4 = _mm_loadu_si128((const __m128i *)&data[start + i + 12]);

vec4uint32 cmp0 = _mm_cmpeq_epi32(key4, v1);
vec4uint32 cmp1 = _mm_cmpeq_epi32(key4, v2);
vec4uint32 cmp2 = _mm_cmpeq_epi32(key4, v3);
vec4uint32 cmp3 = _mm_cmpeq_epi32(key4, v4);

vec8uint16 pack01 = __builtin_ia32_packssdw128(cmp0, cmp1);
vec8uint16 pack23 = __builtin_ia32_packssdw128(cmp2, cmp3);
vec16uint8 pack0123 = __builtin_ia32_packsswb128(pack01, pack23);

int res = __builtin_ia32_pmovmskb128(pack0123);
if (res > 0) {
  int czt = __builtin_ctz(~res + 1);
  return (start + i + czt);
}

Here's what i came up with so far for uint64. The comparison works, i just don't know what to do with the results, and the __builtin_ia32_packssdw() call does not compile:

vec2uint64 v1 = _mm_loadu_si128((const __m128i *)&data[start + i + 0]);
vec2uint64 v2 = _mm_loadu_si128((const __m128i *)&data[start + i + 2]);

vec2uint64 cmp0 = _mm_cmpeq_epi64(key2, v1);
vec2uint64 cmp1 = _mm_cmpeq_epi64(key2, v2);

vec4uint32 pack01 = __builtin_ia32_packssdw(cmp0, cmp1); // error
vec4uint32 pack23 = _mm_set1_epi32(0);
vec16uint8 pack0123 = __builtin_ia32_packsswb128(pack01, pack23);

int res = __builtin_ia32_pmovmskb128(pack0123);
if (res > 0) {
  int czt = __builtin_ctz(~res + 1);
  return (start + i + czt);
}

The error says:

error: cannot convert 'vec1uint64 {aka __vector(2) long unsigned int}'
to '__vector(2) int' for argument '1' to '__vector(4) short int
__builtin_ia32_packssdw(__vector(2) int, __vector(2) int)'

(The typedefs for vec2uint64 are at the top, in the code for uint32.)

My environment:

Linux ws4484 3.5.0-48-generic #72~precise1-Ubuntu SMP Tue Mar 11 20:09:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

My question is not just how i can fix the compiler error, but if somebody has a better idea to get the array index with the match, maybe without the whole packing thing?

Thanks in advance!

like image 730
cruppstahl Avatar asked Jan 11 '23 19:01

cruppstahl


1 Answers

I suggest NOT using the built in intrinsics and implicit vectors. This only makes sense if you don't use the non GCC intrinsics (e.g. _mm_cmpeq_epi32) and only want to stick to GCC. You can do what you want like this

__m128i key2 = _mm_set1_epi64x(key);
__m128i v1 = _mm_loadu_si128((const __m128i *)&data[start + i + 0]);
__m128i v2 = _mm_loadu_si128((const __m128i *)&data[start + i + 2]);

__m128i cmp0 = _mm_cmpeq_epi64(key2, v1);
__m128i cmp1 = _mm_cmpeq_epi64(key2, v2);

__m128i low2  = _mm_shuffle_epi32(cmp0,0xD8);  
__m128i high2 = _mm_shuffle_epi32(cmp1,0xD8);      
__m128i pack = _mm_unpacklo_epi64(low2,high2);

__m128i pack01 = _mm_packs_epi32(pack, _mm_setzero_si128());
__m128i pack0123 = _mm_packs_epi16(pack01, _mm_setzero_si128());

int res =  _mm_movemask_epi8(pack0123);

You can probably find a more efficient version that avoids the packing but then you would have to use a different function than __builtin_ctz.

For 32-bit ints I suggest

__m128i key4 = _mm_set1_epi32(key);
__m128i v1 = _mm_loadu_si128((const __m128i *)&data[start + i + 0]);
__m128i v2 = _mm_loadu_si128((const __m128i *)&data[start + i + 4]);
__m128i v3 = _mm_loadu_si128((const __m128i *)&data[start + i + 8]);
__m128i v4 = _mm_loadu_si128((const __m128i *)&data[start + i + 12]);

__m128i cmp0 = _mm_cmpeq_epi32(key4, v1);
__m128i cmp1 = _mm_cmpeq_epi32(key4, v2);
__m128i cmp2 = _mm_cmpeq_epi32(key4, v3);
__m128i cmp3 = _mm_cmpeq_epi32(key4, v4);

__m128i pack01 = _mm_packs_epi32(cmp0, cmp1);
__m128i pack23 = _mm_packs_epi32(cmp2, cmp3);
__m128i pack0123 = _mm_packs_epi16(pack01, pack23);

int res = _mm_movemask_epi8(pack0123);
like image 160
Z boson Avatar answered Jan 17 '23 14:01

Z boson