Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting GCC to generate a PTEST instruction when using vector extensions

When using the GCC vector extensions for C, how can I check that all the values on a vector are zero?

For instance:

#include <stdint.h>

typedef uint32_t v8ui __attribute__ ((vector_size (32)));

v8ui*
foo(v8ui *mem) {
    v8ui v;
    for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
          v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7];
          mem++)
        v &= *(mem);

    return mem;
}

SSE4.2 has the PTEST instruction which allows to run a test like the one used as the for condition but the code generated by GCC just unpacks the vector and checks the single elements one by one:

.L2:
        vandps  (%rax), %ymm1, %ymm1
        vmovdqa %xmm1, %xmm0
        addq    $32, %rax
        vmovd   %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $1, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $2, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $3, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vextractf128    $0x1, %ymm1, %xmm0
        vmovd   %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $1, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $2, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $3, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vzeroupper
        ret

Is there any way to get GCC to generate an efficient test for that without reverting to using intrinsics?

Update: For reference, code using an unportable GCC builtin for (V)PTEST:

typedef uint32_t v8ui __attribute__ ((vector_size (32)));
typedef long long int v4si __attribute__ ((vector_size (32)));

const v8ui ones = { 1, 1, 1, 1, 1, 1, 1, 1 };

v8ui*
foo(v8ui *mem) {
    v8ui v;
    for ( v = ones;
          !__builtin_ia32_ptestz256((v4si)v,
                                    (v4si)ones);
          mem++)
        v &= *(mem);

    return mem;
}
like image 775
salva Avatar asked Apr 06 '15 13:04

salva


2 Answers

gcc 4.9.2 -O3 -mavx2 (in 64bit mode) didn't realize it could use ptest for this, with either || or |.

The | version extracts the vector elements with vmovd and vpextrd, and combines things with 7 or insns between 32bit registers. So it's pretty bad, and doesn't take advantage of any simplifications that will still produce the same logical truth value.

The || version is just as bad, and does the same extract-an-element-at-a-time, but does a test / jne for every one.

So at this point, you can't count on GCC recognizing tests like this and doing anything remotely efficient. (pcmpeq / movmsk / test is another sequence that wouldn't be bad, but gcc doesn't generate that either.)

like image 200
Peter Cordes Avatar answered Oct 01 '22 17:10

Peter Cordes


Wouldn't vptest help? If you are looking at performance, sometimes you'll be surprised by what the native type can provide. Here is some code that uses vanilla memcmp() and also the vptest instruction (employed via the corresponding intrinsic). I did not time the functions.

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <immintrin.h>

typedef uint32_t v8ui __attribute__ ((vector_size (32)));

v8ui*
foo1(v8ui *mem)
{   
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };

    if (memcmp(mem, &v, sizeof (v8ui)) == 0) {
            printf("Ones\n");
    } else {
            printf("NOT Ones\n");
    }

    return mem;
}

v8ui*
foo2(v8ui *mem)
{   
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
    __m256i a, b;

    a = _mm256_loadu_si256((__m256i *)(&v));
    b = _mm256_loadu_si256((__m256i *)(&mem));

    if (!_mm256_testz_si256(a, b)) {
            printf("NOT Ones\n");
    } else {
            printf("Ones\n");
    }

    return mem;
}

int
main()
{
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
    foo1(&v);
    foo2(&v);
}

Compile flags:

gcc -mavx2 foo.c

Doh! Only now did I see that you wanted to get GCC to generate the vptest instruction without using the intrinsics. I'll leave the code around anyway.

like image 42
pavan Avatar answered Oct 01 '22 18:10

pavan