Why is vectorization not beneficial in this for loop?

Q: Why is vectorization considered a powerful method for optimizing numerical code?

Vectorization is a type of parallel processing. It enables more computer hardware to be devoted to performing the computation, so the computation is done faster.

Tags:

auto-vectorization

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:

int someOuterVariable = 0;

for (unsigned int i = 7; i != -1; i--)
{
  array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}

Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial

I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?

596

asked Jan 12 '21 08:01

1 Answers

It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)

I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.

Try to increase the number of iterations, or help the compiler for computing alignement for example.

Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.

EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:

c = a + b;
d = e * f;

The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

answered Oct 13 '22 00:10

xryl669

Related questions
                            
                                Is new T() equivalent to `mem = operator new(sizeof(T)); new(mem)T`?
                            
                                What exacty is io_context?
                            
                                Why would g++ compiled code write beyond stack pointer?
                            
                                A question about name lookup with friend function
                            
                                Including <Windows.h> causes (unknown attribute"no_init_all") error
                            
                                How to run a Tensorflow-Lite inference in (Android Studio) NDK (C / C++ API)?
                            
                                Is it safe to return a static string_view created from a string literal?
                            
                                TCP - What if client call close() before server accept()
                            
                                Specialize how std::vector grows
                            
                                Why is const char[] a better match for std::ranges::range than for an explicit, const char* free overload, and how to fix it?
                            
                                Multi-dimension array template with dimension deduction
                            
                                requirements for custom container type to use with views
                            
                                Why is sizeof std::mutex == 40 when cache line size is often 64 bytes
                            
                                Is there a way of using `std::optional` optionally for all the members in a class
                            
                                Does C++ for arduino follow the standard?
                            
                                What is move() in c++98?
                            
                                Using ++ as a prefix to a statement of access through class member not causing an error
                            
                                How to use Julia special functions inside c++
                            
                                What does memory_order_consume really do?
                            
                                Specification rule of "definition of a static data member is in the scope of its class" with a templated base class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is vectorization not beneficial in this for loop?

Tags:

c++

vectorization

llvm

clang

auto-vectorization

The Doctor

People also ask

1 Answers

xryl669

Recent Activity

Donate For Us