What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Question

The Intel intrinsics guide states simply that _mm512_load_epi32:

Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst

and that _mm512_load_si512:

Load[s] 512-bits of integer data from memory into dst

What is the difference between these two? The documentation isn't clear.

Peter Cordes · Accepted Answer

There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can see what the clumsy intrinsic naming is trying to say. Or at least you can understand how we ended up with this mess of different documentation suggesting _mm512_load_epi32 vs. _mm512_load_si512.

Almost all AVX512 instructions support merge-masking and zero-masking. (e.g. vmovdqa32 can do a masked load like vmovdqa32 zmm0{k1}{z}, [rdi] to zero vector elements where k1 had a zero bit), which is why different element-size versions of things like vector loads and bitwise operations exist. (e.g. vpxord vs. vpxorq).

But these intrinsics are for the no-masking version. The element-size is totally irrelevant. I'm guessing _mm512_load_epi32 exists for consistency with _mm512_mask_load_epi32 (merge-masking) and _mm512_maskz_load_epi32 (zero-masking). See the docs for the vmovdqa32 asm instruction.

e.g. _mm512_maskz_loadu_epi64(0x55, x) zeros the odd elements for free while loading. (At least it's free if the cost of putting 0x55 into a k register can be hoisted out of a loop. And if we haven't defeated the chance for the compiler to fold a load into a memory operand for an ALU instruction.)

When elements are all loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have different element-size versions of bitwise booleans like _mm_xor_si128 and loads/stores like _mm_load_si128.

Some compilers don't support the element-width names for unaligned unmasked loads. e.g. current gcc doesn't support _mm512_loadu_epi64 even though it's supported _mm512_load_epi64 since the first gcc version to support AVX512 intrinsics at all. (See error: '_mm512_loadu_epi64' was not declared in this scope)

There are no CPUs where the choice of vmovdqa64 vs. vmovdqa32 matters at all for efficiency, so there's zero point in trying to hint the compiler to use one or the other, regardless of the natural element width of your data.

Only FP vs. integer might matter for loads, and Intel's intrinsics already uses different types (__m512 vs. __m512i) for that.

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Tags:

x86

simd

sse

intrinsics

avx512

Qix - MONICA WAS MISTREATED

1 Answers

Peter Cordes

Recent Activity

Donate For Us

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Tags:

x86

simd

sse

intrinsics

avx512

Qix - MONICA WAS MISTREATED

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us