Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

The Intel intrinsics guide states simply that _mm512_load_epi32:

Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst

and that _mm512_load_si512:

Load[s] 512-bits of integer data from memory into dst

What is the difference between these two? The documentation isn't clear.

like image 370
Qix - MONICA WAS MISTREATED Avatar asked Dec 18 '22 19:12

Qix - MONICA WAS MISTREATED


1 Answers

There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can see what the clumsy intrinsic naming is trying to say. Or at least you can understand how we ended up with this mess of different documentation suggesting _mm512_load_epi32 vs. _mm512_load_si512.

Almost all AVX512 instructions support merge-masking and zero-masking. (e.g. vmovdqa32 can do a masked load like vmovdqa32 zmm0{k1}{z}, [rdi] to zero vector elements where k1 had a zero bit), which is why different element-size versions of things like vector loads and bitwise operations exist. (e.g. vpxord vs. vpxorq).

But these intrinsics are for the no-masking version. The element-size is totally irrelevant. I'm guessing _mm512_load_epi32 exists for consistency with _mm512_mask_load_epi32 (merge-masking) and _mm512_maskz_load_epi32 (zero-masking). See the docs for the vmovdqa32 asm instruction.

e.g. _mm512_maskz_loadu_epi64(0x55, x) zeros the odd elements for free while loading. (At least it's free if the cost of putting 0x55 into a k register can be hoisted out of a loop. And if we haven't defeated the chance for the compiler to fold a load into a memory operand for an ALU instruction.)

When elements are all loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have different element-size versions of bitwise booleans like _mm_xor_si128 and loads/stores like _mm_load_si128.


Some compilers don't support the element-width names for unaligned unmasked loads. e.g. current gcc doesn't support _mm512_loadu_epi64 even though it's supported _mm512_load_epi64 since the first gcc version to support AVX512 intrinsics at all. (See error: '_mm512_loadu_epi64' was not declared in this scope)

There are no CPUs where the choice of vmovdqa64 vs. vmovdqa32 matters at all for efficiency, so there's zero point in trying to hint the compiler to use one or the other, regardless of the natural element width of your data.

Only FP vs. integer might matter for loads, and Intel's intrinsics already uses different types (__m512 vs. __m512i) for that.

like image 168
Peter Cordes Avatar answered Apr 13 '23 01:04

Peter Cordes