Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XNOR two 64 bits registers in 8 bit blocks

I have two 64 bit vales and I want to XNOR them as follows:

RAX: 01000001 | 01000010 | 01000011 | 01000001 | 01000101 | 01000110 | 01000111 | 01000001     XNOR
RBX: 01000001 | 01000001 | 01000001 | 01000001 | 01000001 | 01000001 | 01000001 | 01000001
-------------------------------------------------------------------------------------------
RCX:    1          0          0          1          0          0          0          1



XNOR does the following:
    1 XNOR 1 | 1
    1 XNOR 0 | 0
    0 XNOR 1 | 0 
    0 XNOR 0 | 1

so that each time the XNOR results in exactly 0xff it outputs 1 in the corresponding block position in the RCX register.

Is there an I64 instruction, or an arithmetic/logic expression, to solve the problem above?

like image 776
0asm Avatar asked Oct 15 '22 21:10

0asm


1 Answers

The "in 8-bit blocks" part of this is what makes it very different from bitwise XNOR. And you want to horizontally reduce the XNOR result with AND, in 8-bit chunks. This is what SIMD is all about.

The specific operation you want is compare for equality. Fortunately, x86 SSE2 (or MMX) pcmpeqb xmm0, xmm1 does exactly that, producing 0xFF (-1) in elements that compare equal, and 0x00 in other elements. You can movq xmm0, src to set up for it, doing an 8-byte zero extending load into a 16-byte XMM register.

You can get the result (from the low 8 bytes of XMM0) into RCX with movq rcx, xmm0, where a bsf rcx, rcx will find the position of the lowest non-zero bit. Or test rcx, rcx will let you branch if there were any non-zero bits.

If you want RCX = 0x0100000100000001 (i.e. 1 bit at the bottom of each byte), you can use SSSE3 pabsb xmm0, xmm0 before MOVQ to do packed absolute value of bytes, mapping 0xFF -> 1 and leaving 0 unchanged. Unlike SSE2, this is not baseline for x86-64, but CPUs that lack it are pretty thoroughly obsolete (like AMD Phenom II being the latest).


The normal way to get SIMD compare results into an integer reg is pmovmskb. It's as efficient as movq r, x but lets you get all 16 byte elements without even using 64-bit registers.

    movq     xmm0, [rdi]       ; 8-byte load.  Use movdqu for all 16 bytes
    movq     xmm1, [rsi]
    pcmpeqb  xmm0, xmm1
    pmovmskb ecx, xmm0

    cmp      ecx, 0xffff
    je       all_were_equal

    test     cl, cl        ; low 8 bytes of compare result -> low 8 bits of RCX
    jnz      some_were_equal

This takes the high bit of each byte. i.e. gives you a compare bitmap. You can bsf ecx, ecx to find which (if any) of the 16 bytes was the first match. (If your inputs were zero-extended 8-byte values, then the 9th byte will always match. CH will be all-ones from the upper half of the pmovmskb input.)

Of course instead of bit-scanning the compare result, you can simply branch on it. The common ways are:

  • test ecx, ecx / jnz to jump if any elements compared true
  • cmp ecx, 0xffff / je to jump if all matched.

Related: Compare 16 byte strings with SSE for doing this with intrinsics.


You could do this with MMX movq mm0, [rdi] / pcmpeqb mm0, [rsi], but MMX has worse throughput than SSE2 on some of the most recent CPUs (e.g. fewer execution ports on Skylake), and you'd need a slow-ish emms when you're done to restore the x87 state to x87 mode.

Still, you'd be saving a movq if your data is naturally in 8-byte chunks so you can't naturally just do 16 bytes at a time. And the instructions are more compact (machine-code size), as you can see in the Intel manuals for their encodings. So if 8-byte chunks are a really good fit, and you can sink the EMMS out of a large enough loop, MMX is worth considering. (Or if you definitely never use x87 instructions at all, not even by calling any library functions, and can skip EMMS)

like image 71
Peter Cordes Avatar answered Nov 23 '22 23:11

Peter Cordes