I'd like to load a 128 bit register with 32-bit non-contiguous floats. Actually, those floats are spaced by 128 bits in memory.
So if memory looks like that :
| Float 0 | Float X | Float X | Float X |
| Float 4 | Float X | Float X | Float X |
| Float 8 | Float X | Float X | Float X |
| Float 12 | Float X | Float X | Float X |
I'd like to load a vector like this :
| Float 0 | Float 4 | Float 8 | Float 12 |
Hopefully you're going to use the other data for something, in which case loading everything and doing a transpose is more likely to be useful.
If not, then SIMD at all is only viable if there's quite a bit of work to do once the data is in vectors, because packing it into vectors is expensive.
movss
/ insertps
as shown in @zx485's answer is the "normal" way, like you'd probably get from a compiler if you used _mm_set_ps(f[12], f[8], f[4], f[0]);
When your stride is exactly 4, with AVX you can span all four floats with two loads, and blend.
(related: What's the fastest stride-3 gather instruction sequence? Or for stride 2, it's more obviously worth doing vector loads and shuffling.)
vmovups ymm1, [float0] ; float0 and float4 in the low element of low/high lanes
vblendps ymm1, [float8 - 4], 0b00100010 ; { x x f12 f4 | x x f8 f0 }
This isn't great because you're likely to cross a cache-line boundary with one of the loads. You could achieve something similar with a vshufps ymm0, ymm1, [float8], 0b???????
for the 2nd load.
This might be good depending on surrounding code, especially if you have AVX2 for vpermps
(with a shuffle-control vector constant) or vpermpd
(with an immediate) for a lane-crossing shuffle to put the elements you want into the low 128b lane.
Without AVX2 for a cross-lane shuffle, you'd need to vextractf128
and then shufps
. This might require some planning ahead to have elements in places where this shufps
can put them in the right place.
This all works with intrinsics, of course, but they take a lot more typing.
If you have AVX2 available, you could use the VGATHERDPS instruction to achieve your goal, which was explained here in this SO answer. In your case you would just have to initialize the index-vector to 0,1,2,3,... (and scale that up to 0,4,8,12 with the gather addressing mode).
.data
.align 16
ddIndices dd 0,1,2,3
dpValues REAL4 ... ; replace 'dpValues' with your value array
.code
lea rsi, dpValues
vmovdqa xmm7, ddIndices
.loop:
vpcmpeqw xmm1, xmm1 ; set to all ones
vpxor xmm0, xmm0 ; break dependency on previous gather
vgatherdps xmm0, [rsi+xmm7*4], xmm1
; do something with gather result in xmm0
add rsi, 16
cmp rsi, end_pointer
jb .loop ; do another gather with same indices, base+=16
XMM1 is the condition mask
which selects what elements are loaded.
Be aware, that this instruction is not that fast on Haswell, but the implementation is faster on Broadwell and faster again on Skylake.
Even so, using a gather instruction for small-stride loads is probably only a win with 8-element ymm vectors on Skylake. According to Intel's optimization manual (11.16.4 Considerations for Gather Instructions), Broadwell hardware-gather with 4-element vectors has a best-case throughput of 1.56 cycles per element when the data is hot in L1D cache.
On pre-AVX2 architectures there is no way (known to me) to do this without loading all values separately like this (using SSE4.1 insertps
or pinsrd
).
lea esi, dpValues
movss xmm0, [esi] ; breaks dependency on old value of xmm0
insertps xmm0, [esi+4], 1<<4 ; dst element index in bits 5:4 of the imm8
insertps xmm0, [esi+8], 2<<4
insertps xmm0, [esi+12], 3<<4
For integer data, the last instruction would be pinsrd xmm0, [esi+12], 3
.
Without SSE4.1, shuffle movss
results together with unpcklps
/ unpcklpd
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With