I want to store 4 32 bit floats into xmm0 where each of these floats is stored in one 128 bit register. For example I have 4 floats: xmm1: 10.2 xmm2: 5.8 xmm3: 9.3 xmm4: 12.7 (each only using 32 bits of the 128 bit register) and want them to be stored into xmm0 like that: 10.2, 5.8, 9.3, 12.7 So that they are stored next to them inside of xmm0.
Then I also would like to extract each of them separately after I have done some math on xmm0 (e.g. mulps ..)
I've tried using movlps and movhps, but they use only memory to load from and not a 128 bit register as source. I wouldn't like to use any additional memory due to performance.
PSLLDQ might help, but is there a better solution for my problem?
Look at compiler output for _mm_set_ps(f3,f2,f1,f0) or for _mm_setr_ps(f0,f1,f2,f3) with your choice of tune and -march options.
Or look at Agner Fog's optimization guide: he has a chapter on SSE/AVX with a handy table of data-movement instruction by type. Great for learning your way around which shuffles are available in the highly non-orthogonal SSE/AVX extensions.
As people have pointed out, the standard way is 2x unpcklps to merge pairs into vectors of [00ba] [00dc] where 0 is a don't-care value or actually 0.0 if the upper elements of your scalar floats happened to be zero. (My notation follows the Intel convention from diagrams of having the high element at the left, so left shifts move data to the left in your notation, and looking at your data with different element widths don't change how you write it.)
Then movlhps to copy the low qword of one xmm register to the high qword of another (merging into the existing value).
If this wasn't obvious and well-known to you, you should be writing in C with intrinsics, and looking at the optimized compiler output to learn the basic ways. clang has a very good shuffle optimizer that can find better ways to implement the logic of your intrinsics into asm.
Those 3 instructions are all shuffles, and on Intel Sandybridge-family CPUs are limited to 1 per clock throughput (competing for port 5).
If we have SSE4.1 available for blendps (with an immediate blend-control), we might be able to use that as the final step instead of a shuffle. It can run on any port.
I think we can use shufps to create vectors of [0c0a] and [d0b0]. The low 2 elements of shufps output comes from the first source=dst operand, the other half comes from the other source.
If your input vectors were actually zero-extended with definitely no high garbage, you can use SSE1 orps instead of a blend to get [dcba]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With