Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE intrinsics check zero flag

Tags:

c++

intrinsics

I was wondering if it was possible to check the processor's flags register by the means of Intel's SSE intrinsic functions?

For example:

int idx = _mm_cmpistri(mmrange, mmstr, 0x14);
int zero = _mm_cmpistrz(mmrange, mmstr, 0x14);

In this example the compiler is able to optimize those two intrinsics to a single instruction (pcmpistri) and checking the flags register by a jump instruction (jz).

However in the following example the compiler doesn't manage to optimize the code properly:

__m128i mmmask = _mm_cmpistrm(mmoldchar, mmstr, 0x40);
int zero = _mm_cmpistrz(mmoldchar, mmstr, 0x40);

Here, the compiler generates a pcmpistrm and a pcmpistri instruction. However, in my opinion, the second instruction is redundant because pcmpistrmsets the flags in the processor's flags register in the same way as pcmistri.

So, to come back to my question, is there a way to either read the flags register directly or to instruct the compiler to only generate a pcmpistrm instruction?

like image 712
Philipp Neufeld Avatar asked Oct 28 '25 17:10

Philipp Neufeld


1 Answers

Looks like just an MSVC missed-optimization bug, not anything inherent.

gcc6.2 and icc17 successfully use both results from one PCMPISTRM in a test function I wrote that branches on the zero result (on the Godbolt compiler explorer):

#include <immintrin.h>
__m128i foo(__m128i mmoldchar, __m128i mmstr)
{      
  __m128i mmmask = _mm_cmpistrm(mmoldchar, mmstr, 0x40);
  int zero = _mm_cmpistrz(mmoldchar, mmstr, 0x40);
  if(zero)
    return mmmask;
  else
    return _mm_setzero_si128();
}

    ##gcc6.2 -O3 -march=nehalem
    pcmpistrm       xmm0, xmm1, 64
    je      .L5
    pxor    xmm0, xmm0
    ret
.L5:
    ret

OTOH, clang3.9 fails to CSE, and uses a PCMPISTRI.

foo:
    movdqa  xmm2, xmm0
    pcmpistri       xmm2, xmm1, 64
    pxor    xmm0, xmm0
    jne     .LBB0_2
    pcmpistrm       xmm2, xmm1, 64
.LBB0_2:
    ret

Note that according to Agner Fog's instruction tables, PCMPISTRM has good throughput but high latency, so there's lots of room to do two in parallel if latency is the bottleneck. Jumping through hoops like using __readflags() might actually be worse.

like image 98
Peter Cordes Avatar answered Oct 31 '25 07:10

Peter Cordes