I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C.
I have some code making indirect accesses that I want to vectorize.
for (i = 0; i < LENGTH; ++i) {
foo[bar[i]] *= 2;
}
Since I have the indices I want to double inside bar
, I was wondering if there was a way to load those indices of foo
into a vector register and then I could apply my math and store it back to the same indices.
Something like the following. The load
and store
instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2.
for (i = 0; i < LENGTH; i += 8) {
// For simplicity, I'm leaving out any pointer type casting
__m256 ymm0 = _mm256_load_indirect(bar+i);
__m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's
__m256 ymm2 = _mm256_mul_ps(ymm0, ymm1);
_mm256_store_indirect(ymm2, bar+i);
}
Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?
Gather-scatter is a type of memory addressing that often arises when addressing vectors in sparse linear algebra operations. It is the vector-equivalent of register indirect addressing, with gather involving indexed reads and scatter indexed writes.
Scatter/Gather: the Basic IdeaCaches are built from rows – you want one piece of data, you get the whole row. If you want to manage your performance tightly, then you try to have as many related variables as possible on the same row so that you get more bang for your caching buck and reduce your cache misses.
(I' writing an answer to this old question as I think it may help others.)
No. There are no scatter/gather instructions in the SSE and AVX instruction sets.
Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX.
For newer instruction sets the situation is different. In AVX2 you have
In AVX-512 we got
However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With