Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading non contiguous values with Intel SIMD SSE

I'd like to load a 128 bit register with 32-bit non-contiguous floats. Actually, those floats are spaced by 128 bits in memory.

So if memory looks like that :

| Float 0  | Float X | Float X | Float X |
| Float 4  | Float X | Float X | Float X |
| Float 8  | Float X | Float X | Float X |
| Float 12 | Float X | Float X | Float X |

I'd like to load a vector like this :

| Float 0  | Float 4 | Float 8 | Float 12 |
like image 530
PinkPR Avatar asked Mar 17 '16 16:03

PinkPR


2 Answers

Hopefully you're going to use the other data for something, in which case loading everything and doing a transpose is more likely to be useful.

If not, then SIMD at all is only viable if there's quite a bit of work to do once the data is in vectors, because packing it into vectors is expensive.


movss / insertps as shown in @zx485's answer is the "normal" way, like you'd probably get from a compiler if you used _mm_set_ps(f[12], f[8], f[4], f[0]);


When your stride is exactly 4, with AVX you can span all four floats with two loads, and blend.

(related: What's the fastest stride-3 gather instruction sequence? Or for stride 2, it's more obviously worth doing vector loads and shuffling.)

vmovups   ymm1, [float0]                  ; float0 and float4 in the low element of low/high lanes
vblendps  ymm1, [float8 - 4], 0b00100010  ;  { x x f12 f4 | x x f8 f0 }

This isn't great because you're likely to cross a cache-line boundary with one of the loads. You could achieve something similar with a vshufps ymm0, ymm1, [float8], 0b??????? for the 2nd load.

This might be good depending on surrounding code, especially if you have AVX2 for vpermps (with a shuffle-control vector constant) or vpermpd (with an immediate) for a lane-crossing shuffle to put the elements you want into the low 128b lane.

Without AVX2 for a cross-lane shuffle, you'd need to vextractf128 and then shufps. This might require some planning ahead to have elements in places where this shufps can put them in the right place.


This all works with intrinsics, of course, but they take a lot more typing.

like image 90
Peter Cordes Avatar answered Nov 01 '22 05:11

Peter Cordes


If you have AVX2 available, you could use the VGATHERDPS instruction to achieve your goal, which was explained here in this SO answer. In your case you would just have to initialize the index-vector to 0,1,2,3,... (and scale that up to 0,4,8,12 with the gather addressing mode).

.data
  .align 16
  ddIndices dd 0,1,2,3
  dpValues  REAL4 ...   ; replace 'dpValues' with your value array
.code
  lea        rsi, dpValues
  vmovdqa    xmm7, ddIndices

.loop:
  vpcmpeqw   xmm1, xmm1                 ; set to all ones
  vpxor      xmm0, xmm0                 ; break dependency on previous gather
  vgatherdps xmm0, [rsi+xmm7*4], xmm1
  ; do something with gather result in xmm0

  add        rsi, 16
  cmp        rsi, end_pointer
  jb      .loop                    ; do another gather with same indices, base+=16

XMM1 is the condition mask which selects what elements are loaded.

Be aware, that this instruction is not that fast on Haswell, but the implementation is faster on Broadwell and faster again on Skylake.

Even so, using a gather instruction for small-stride loads is probably only a win with 8-element ymm vectors on Skylake. According to Intel's optimization manual (11.16.4 Considerations for Gather Instructions), Broadwell hardware-gather with 4-element vectors has a best-case throughput of 1.56 cycles per element when the data is hot in L1D cache.


On pre-AVX2 architectures there is no way (known to me) to do this without loading all values separately like this (using SSE4.1 insertps or pinsrd).

lea      esi, dpValues
movss    xmm0, [esi]          ; breaks dependency on old value of xmm0
insertps xmm0, [esi+4], 1<<4  ; dst element index in bits 5:4 of the imm8
insertps xmm0, [esi+8], 2<<4
insertps xmm0, [esi+12], 3<<4

For integer data, the last instruction would be pinsrd xmm0, [esi+12], 3.

Without SSE4.1, shuffle movss results together with unpcklps / unpcklpd

like image 26
zx485 Avatar answered Nov 01 '22 05:11

zx485