Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

Tags:

C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).

I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.

Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.

Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)

As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?

347

asked Jun 03 '17 04:06

Bernard

1 Answers

The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).

This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.

I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.

This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.

But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.

Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.

Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...

121

answered Nov 11 '22 20:11

Tim

Related questions
                            
                                Run a specifiable number of commands in parallel - contrasting xargs -P, GNU parallel, and "moreutils" parallel
                            
                                Task.WaitAny when there are multiple tasks that finishes at the same time
                            
                                Scaling Connections with BlockingCollection<T>()
                            
                                Parallel processes overwriting progress bars (tqdm)
                            
                                How to parallelize Sudoku solver using Grand Central Dispatch?
                            
                                Python threading unexpectedly slower
                            
                                How do I find information about the parallel architecture of my CPU?
                            
                                .NET Framework 4.0: Chaining tasks in a loop
                            
                                Equivalent of do() while{} in Parallel
                            
                                Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?
                            
                                Communication of parallel processes: what are my options?
                            
                                Parallel processing in Matlab
                            
                                OpenMP Parallel for-loop showing little performance increase
                            
                                parallel foreach loops produce mclapply error
                            
                                Does the OpenJDK JVM parallelize bytecode?
                            
                                OpenMP parallel for - what is default schedule?
                            
                                OpenCL for loop execution model
                            
                                How to share a variable in 'joblib' Python library
                            
                                Parallelize this nested for loop in python
                            
                                pmap slow for toy example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Do compilers usually emit vector (SIMD) instructions when not explicitly told to do so?

Tags:

vectorization

parallel-processing

c++17

simd

auto-vectorization

Bernard

People also ask

1 Answers

Tim

Recent Activity

Donate For Us