I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C. I have some code making indirect accesses that I want to vectorize. <pre class="prettyprint"><code>for (i = 0; i < LENGTH; ++i) { foo[bar[i]] *= 2; } </code></pre> Since I have the indices I want to double inside <code>bar</code>, I was wondering if there was a way to load those indices of <code>foo</code> into a vector register and then I could apply my math and store it back to the same indices. Something like the following. The <code>load</code> and <code>store</code> instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2. <pre class="prettyprint"><code>for (i = 0; i < LENGTH; i += 8) { // For simplicity, I'm leaving out any pointer type casting __m256 ymm0 = _mm256_load_indirect(bar+i); __m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's __m256 ymm2 = _mm256_mul_ps(ymm0, ymm1); _mm256_store_indirect(ymm2, bar+i); } </code></pre> Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?

(I' writing an answer to this old question as I think it may help others.) <h3>Short answer</h3> No. There are no scatter/gather instructions in the SSE and AVX instruction sets. <h3>Longer answer</h3> Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX. For newer instruction sets the situation is different. In AVX2 you have <ul> <li> VGATHERDPD, VGATHERDPS, VGATHERQPD, VGATHERQPS for floating point gather (intrinsics here)</li> <li> VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ for integer gather (intrinsics here)</li> </ul> In AVX-512 we got <ul> <li> VSCATTERDPD, VSCATTERDPS, VSCATTERQPD, VSCATTERQPS for floating point scatter (intrinsics here)</li> <li> VPSCATTERDD, VPSCATTERQD, VPSCATTERDQ, VPSCATTERQQ for integer scatter (intrinsics here)</li> </ul> However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.

How to do an indirect load (gather-scatter) in AVX or SSE instructions?

Tags:

c

vector

avx

intel

sse

I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C.

I have some code making indirect accesses that I want to vectorize.

for (i = 0; i < LENGTH; ++i) {
   foo[bar[i]] *= 2;
}

Since I have the indices I want to double inside bar, I was wondering if there was a way to load those indices of foo into a vector register and then I could apply my math and store it back to the same indices.

Something like the following. The load and store instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2.

for (i = 0; i < LENGTH; i += 8) {
   // For simplicity, I'm leaving out any pointer type casting
   __m256 ymm0 = _mm256_load_indirect(bar+i);
   __m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's
   __m256 ymm2 = _mm256_mul_ps(ymm0, ymm1);
   _mm256_store_indirect(ymm2, bar+i);
}

Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?

697

asked May 01 '16 20:05

The Unknown Dev

1 Answers

(I' writing an answer to this old question as I think it may help others.)

Short answer

No. There are no scatter/gather instructions in the SSE and AVX instruction sets.

Longer answer

Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX.

For newer instruction sets the situation is different. In AVX2 you have

VGATHERDPD, VGATHERDPS, VGATHERQPD, VGATHERQPS for floating point gather (intrinsics here)
VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ for integer gather (intrinsics here)

In AVX-512 we got

VSCATTERDPD, VSCATTERDPS, VSCATTERQPD, VSCATTERQPS for floating point scatter (intrinsics here)
VPSCATTERDD, VPSCATTERQD, VPSCATTERDQ, VPSCATTERQQ for integer scatter (intrinsics here)

However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.

133

answered Sep 19 '22 20:09

Pibben

Related questions
                            
                                Sorting array from typedef struct in C
                            
                                How can I get consistent program behavior when using floats?
                            
                                Safely passing read-only data to a new thread
                            
                                How does pthread_create() work?
                            
                                Const-correctness in C
                            
                                how standard specify atomic write to regular file(not pipe or fifo)?
                            
                                Shrink int array C
                            
                                Is there an arbitrary precision floating point library for C/C++ which allows arbitrary precision exponents?
                            
                                Understanding link between CONFIG_SMP, Spinlocks and CONFIG_PREEMPT in latest (3.0.0 and above) Linux kernel
                            
                                Is all program code loaded into the text\code section\segment of memory
                            
                                Memory Leaks in GTK hello_world program
                            
                                Does one still need to use -fPIC when compiling with GCC?
                            
                                What is a typical keypress duration
                            
                                _Pragma and macro substitution
                            
                                Alternative of system() in c Linux to execute a terminal command on linux
                            
                                How Dangerous is This Faster `strlen`?
                            
                                Real-time aware sleep() call?
                            
                                Embed manifest file to require administrator execution level with mingw32
                            
                                Testing a kernel module
                            
                                (Where) Does clang document implementation-defined behavior?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With