SSE2: How To Load Data From Non-Contiguous Memory Locations?

Question

I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1 cache, though, so memory latency/bandwidth isn't a bottleneck.

I'm willing to use assembly language or SSE2 intrinsics to parallelize this. I know I need to load two floats at a time into the two lower dwords of an XMM register, convert them to two doubles using cvtps2pd, then add them to two accumulators at a time using addpd.

My question is, how do I get the two floats into the two lower dwords of a single XMM register if they aren't adjacent to each other in memory? Obviously any technique that's so slow that it defeats the purpose of parallelization isn't useful. An answer in either ASM or Intel/GCC intrinsics would be appreciated.

EDIT:

The size of the float array is, strictly speaking, not known at compile time but it's almost always 256, so this can be special cased.
The element of the float array that should be read is determined by loading a value from a byte array. There are six byte arrays, one for each accumulator. The reads from the byte array are sequential, one from each array for each loop iteration, so there shouldn't be many cache misses there.
The access pattern of the float array is for all practical purposes random.

gsg · Accepted Answer

For this specific case, take a look at the unpack-and-interleave instructions in your instruction reference manual. It would be something like

movss xmm0, <addr1>
movss xmm1, <addr2>
unpcklps xmm0, xmm1

Also take a look at shufps, which is handy whenever you have the data you want in the wrong order.

SSE2: How To Load Data From Non-Contiguous Memory Locations?

Tags:

performance

optimization

assembly

simd

sse

dsimcha

1 Answers

gsg

Recent Activity

Donate For Us

SSE2: How To Load Data From Non-Contiguous Memory Locations?

Tags:

performance

optimization

assembly

simd

sse

dsimcha

1 Answers

gsg

Related questions

Recent Activity

Donate For Us