There are 2 pointers to 2 unaligned 8 byte chunks to be loaded into an xmm register. If possible, using intrinsics. And if possible, without using an auxiliary register. Without pinsrd. (SSSE Core 2)
From the msvc
specs, it looks like you can do the following:
__m128d xx; // an uninitialised xmm register
xx = _mm_loadh_pd(xx, ptra); // load the higher 64 bits from (unaligned) ptra
xx = _mm_loadl_pd(xx, ptrb); // load the lower 64 bits from (unaligned) ptrb
Loading from unaligned storage (in my experience) is very much slower than loading from aligned pointers, so you properly wouldn't want to be doing this type of operation too often - if you really want higher performance.
Hope this helps.
Unaligned access is so much slower than aligned access (at least pre-Nehalem ); you may get better speed by loading the aligned 128 bit words that contain the desired unaligned 64 bit words, then shuffle them to make the result you want.
Assumes:
e.g. (not tested)
int aoff = ptra & 15;
int boff = ptrb & 15;
__m128 va = _mm_load_ps( (char*)ptra - aoff );
__m128 vb = _mm_load_ps( (char*)ptrb - boff );
switch ( (aoff<<4) | boff )
{
case 0: _mm_shuffle_ps(va,vb, ...
The number of cases depends on whether you can assume 64 bit alignment
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With