Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert _mm_clmulepi64_si128 to vmull_{high}_p64

I have the following Intel PCLMULQDQ intrinsic (a carryless multiply):

__m128i a, b;   // Set to some value
__m128i r = _mm_clmulepi64_si128(a, b, 0x10);

The 0x10 tells me the multiply is:

r = a[63:0] * b[127:64]

I need to convert it to NEON (or more correctly, use the Crypto extension):

poly64_t a, b;   // Set to some value
poly16x8_t = vmull_p64(...) or vmull_high_p64(...);

I think vmull_p64 works on the low 64-bits, while vmull_high_p64 operates on the high 64 bits. I think I need to shift one of the values 128-bit values to mimic _mm_clmulepi64_si128(a, b, 0x10). The docs for PMULL, PMULL2 (vector) are not too clear, and I'm not sure what the result will be because I don't understand 2's arrangement specifier. The ARM ACLE 2.0 is not too helpful either:

poly128_t vmull_p64 (poly64_t, poly64_t);

Performs widening polynomial multiplication on double-words low part. Available on ARMv8 AArch32 and AArch64.

poly128_t vmull_high_p64 (poly64x2_t, poly64x2_t);

Performs widening polynomial multiplication on double-words high part. Available on ARMv8 AArch32 and AArch64.

How do I convert _mm_clmulepi64_si128 to vmull_{high}_p64?


For anyone contemplating the investment in NEON, PMULL and PMULL2... The 64-bit multiplier and polynomial support is worth it. Benchmarks show GCC code for GMAC went from 12.7 cpb and 90 MB/s (C/C++) down to 1.6 cpb and 670 MB/s (NEON and PMULL{2}).

like image 242
jww Avatar asked Mar 12 '23 15:03

jww


2 Answers

Since you clarified the source of your confusion with a comment:

A full multiply produces a result twice as wide as the inputs. An add can produce at most one carry-out bit, but a mul produces a whole upper half.

A multiply is exactly equivalent to shifts + adds, and those shifts bring bits from one operand up to as high as 2N - 1 (when the inputs are N bits wide). See Wikipedia's example.

In a normal integer multiply (with carry in the add steps) like x86's mul instruction, carry from the partial sums can set the high bit, so the result is exactly twice as wide.

XOR is add without carry, so a carryless multiply is the same shift-and-add algo, but with XOR instead of add-with-carry. In a carryless multiply, there's no carry, so the top bit of the full-width result is always zero. Intel even makes this explicit in the Operation section of the x86 insn ref manual for pclmuludq: DEST[127] ← 0;. That section documents precisely all the shifting and XORing that produces the result.


The PMULL[2] docs seem pretty clear to me. The destination has to be a .8H vector (which means eight 16-bit (Halfword) elements). The sources for PMULL have to be .8B vectors (8 one-Byte elements), while the sources for PMULL2 have to be .16B (16 one-Byte elements, of which only the upper 8 of each source are used).

If this was ARM32 NEON, where the upper half of each 16B vector register is an odd-numbered narrower register, PMULL2 wouldn't be useful for anything.


There's no "operation" section to describe exactly which bits multiply with which other bits, though. Fortunately, the paper linked in comments nicely summarizes the available instructions for ARMv7, and ARMv8 32 and 64 bit. The .8B / .8H organization specifiers seem to be bogus, because PMULL does perform a single 64x64 -> 128 carryless mul like SSE's pclmul instruction. The ARMv7 VMULL.P8 NEON insn does do a packed 8x8->16, but makes it clear that PMULL (and ARMv8 AArch32 VMULL.P8) are different.

It's too bad the ARM doc doesn't say any of this; it seems horribly lacking, esp. re the misleading .8B vector organization stuff. That paper shows an example using the expected .1q and .1d (and .2d) organizations, so maybe the assembler doesn't care what you think your data means, as long as it's the right size.


To do a high-with-low multiply, you need to shift one of them.

For example, if you need all four combinations (a0*b0, a1*b0, a0*b1, a1*b1), like you do to build a 128x128 -> 128 multiply out of 64x64 -> 128 multiplies (with Karatsuba), you could do it like this:

pmull   a0b0.8H, a.8B,  b.8B
pmull2  a1b1.8H, a.16B, b.16B
swap a's top and bottom half, which I assume can be done efficiently somehow
pmull   a1b0.8H, swapped_a.8B,  b.8B
pmull2  a0b1.8H, swapped_a.16B, b.16B

So it looks like ARM's design choice to include lower-lower and upper-upper, but not cross-multiply instructions (or a selector constant like x86 does) doesn't cause much inefficiency. And since ARM instructions can't just tack on extra immediates the way x86's variable-length machine encoding can, that probably wasn't an option.


Another version of the same thing, with a real shuffle instruction and Karatsuba afterwards (copied verbatim from Implementing GCM on ARMv8). But still made-up register names. The paper reuses the same temporary register along the way, but I've named them the way I might for a C intrinsics version. This makes the operation of the extended-precision multiply pretty clear. The compiler can reuse dead registers for us.

1:  pmull    a0b0.1q, a.1d, b.1d
2:  pmull2   a1b1.1q, a.2d, b.2d
3:  ext.16b  swapped_b, b, b, #8
4:  pmull    a0b1.1q, a.1d, swapped_b.1d
5:  pmull2   a1b0.1q, a.2d, swapped_b.2d
6:  eor.16b  xor_cross_muls, a0b1, a1b0
7:  ext.16b  cross_low,      zero, xor_cross_muls, #8
8:  eor.16b  result_low,     a0b0, cross_low
9:  ext.16b  cross_high,     xor_cross_muls, zero, #8
10: eor.16b  result_high,    a1b1, cross_high
like image 181
Peter Cordes Avatar answered Mar 14 '23 20:03

Peter Cordes


How do I convert _mm_clmulepi64_si128 to vmull_{high}_p64?

Here are results from the sample program below. The conversions are:

  1. _mm_clmulepi64_si128(a, b, 0x00)vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b, 0))

  2. _mm_clmulepi64_si128(a, b, 0x01)vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b, 0))

  3. _mm_clmulepi64_si128(a, b, 0x10)vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b, 1))

  4. _mm_clmulepi64_si128(a, b, 0x11)vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b, 1))

For case (4), _mm_clmulepi64_si128(a, b, 0x11), the following also holds:

  1. _mm_clmulepi64_si128(a, b, 0x11)vmull_high_p64((poly64x2_t)a, (poly64x2_t)b)

I'm guessing the cases (1) through (4) can spill out into memory if not careful because vgetq_lane_u64 returns a scalar or non-vector type. I'm also guessing case (5) has a propensity to stay in the Q registers because its a vector type.


x86_64 and _mm_clmulepi64_si128:

$ ./mul-sse-neon.exe
IS_X86: true
****************************************
clmulepi64(a, b, 0x00)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x606060606060606, r[1]: 0x606060606060606
****************************************
clmulepi64(a, b, 0x01)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xc0c0c0c0c0c0c0c, r[1]: 0xc0c0c0c0c0c0c0c
****************************************
clmulepi64(a, b, 0x10)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xa0a0a0a0a0a0a0a, r[1]: 0xa0a0a0a0a0a0a0a
****************************************
clmulepi64(a, b, 0x11)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x1414141414141414, r[1]: 0x1414141414141414

ARM64 and vmull_p64:

$ ./mul-sse-neon.exe 
IS_ARM: true
****************************************
vmull_p64(a, b, 0x00)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x606060606060606, r[1]: 0x606060606060606
****************************************
vmull_p64(a, b, 0x01)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xa0a0a0a0a0a0a0a, r[1]: 0xa0a0a0a0a0a0a0a
****************************************
vmull_p64(a, b, 0x10)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xc0c0c0c0c0c0c0c, r[1]: 0xc0c0c0c0c0c0c0c
****************************************
vmull_p64(a, b, 0x11)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x1414141414141414, r[1]: 0x1414141414141414

The sample program mul-sse-neon.cc:

#define IS_ARM (__arm__ || __arm32__ || __aarch32__ || __arm64__ || __aarch64__)
#define IS_X86 (__i386__ || __i586__ || __i686__ || __amd64__ || __x86_64__)

#if (IS_ARM)
# include <arm_neon.h>
# if defined(__ARM_ACLE) || defined(__GNUC__)
#  include <arm_acle.h>
# endif
#endif

#if (IS_X86)
# include <emmintrin.h>
# if defined(__GNUC__)
#  include <x86intrin.h>
# endif
#endif

#if (IS_ARM)
typedef uint64x2_t word128;
#elif (IS_X86)
typedef __m128i word128;
#else
# error "Need a word128"
#endif

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

void print_val(const word128* value, const char* label);

/* gcc -DNDEBUG -g3 -O0 -march=native mul-sse-neon.cc -o mul-sse-neon.exe */
/* gcc -DNDEBUG -g3 -O0 -march=armv8-a+crc+crypto mul-sse-neon.cc -o mul-sse-neon.exe */

int main(int argc, char* argv[])
{
#if (IS_ARM)
    printf("IS_ARM: true\n");
#elif (IS_X86)
    printf("IS_X86: true\n");
#endif

    word128 a,b, r;
    a[0] = 0x2222222222222222, a[1] = 0x4444444444444444;
    b[0] = 0x3333333333333333, b[1] = 0x5555555555555555;

#if (IS_ARM)

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x00)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b,0));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x01)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b,1));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x10)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b,0));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x11)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b,1));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

#elif (IS_X86)

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x00)\n");
    r = _mm_clmulepi64_si128(a, b, 0x00);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x01)\n");
    r = _mm_clmulepi64_si128(a, b, 0x01);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x10)\n");
    r = _mm_clmulepi64_si128(a, b, 0x10);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x11)\n");
    r = _mm_clmulepi64_si128(a, b, 0x11);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

#endif

    return 0;
}

static const word128 s_v = {0,0};
static const char s_l[] = "";
void print_val(const word128* value, const char* label)
{
    const word128* v = (value ? value : &s_v);
    const char* l = (label ? label : s_l);

#if (IS_ARM)
    printf("%s[0]: 0x%" PRIx64 ", %s[1]: 0x%" PRIx64 "\n", l, (*v)[0], l, (*v)[1]);
#elif (IS_X86)
    printf("%s[0]: 0x%" PRIx64 ", %s[1]: 0x%" PRIx64 "\n", l, (*v)[0], l, (*v)[1]);
#endif
}

The code for vmull_high_p64 is as follows. It always produces the same result because its always taking the same high words:

printf("****************************************\n");
printf("vmull_p64(a, b)\n");
r = (uint64x2_t)vmull_high_p64((poly64x2_t)a, (poly64x2_t)b);
print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

For completeness, switching the data to:

word128 a,b, r;
a[0] = 0x2222222233333333, a[1] = 0x4444444455555555;
b[0] = 0x6666666677777777, b[1] = 0x8888888899999999;

Produces the following results:

$ ./mul-sse-neon.exe
IS_X86: true
****************************************
clmulepi64(a, b, 0x00)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0xd0d0d0d09090909, r[1]: 0xc0c0c0c08080808
****************************************
clmulepi64(a, b, 0x01)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x191919191b1b1b1b, r[1]: 0x181818181a1a1a1a
****************************************
clmulepi64(a, b, 0x10)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x111111111b1b1b1b, r[1]: 0x101010101a1a1a1a
****************************************
clmulepi64(a, b, 0x11)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x212121212d2d2d2d, r[1]: 0x202020202c2c2c2c

And:

$ ./mul-sse-neon.exe 
IS_ARM: true
****************************************
vmull_p64(a, b, 0x00)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0xd0d0d0d09090909, r[1]: 0xc0c0c0c08080808
****************************************
vmull_p64(a, b, 0x01)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x111111111b1b1b1b, r[1]: 0x101010101a1a1a1a
****************************************
vmull_p64(a, b, 0x10)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x191919191b1b1b1b, r[1]: 0x181818181a1a1a1a
****************************************
vmull_p64(a, b, 0x11)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x212121212d2d2d2d, r[1]: 0x202020202c2c2c2c
like image 36
jww Avatar answered Mar 14 '23 20:03

jww