HPC programming language relying on implicit vectorization

Tags:

Are there programming languages or language extensions that rely on implicit vectorization?

I would need something that make aggressive assumptions to generate good DLP/vectorized code, for SSE4.1, AVX, AVX2 (with or without FMA3/4) in single/double precision from scalar C code.

For the last 10 years I had fun relying on the Intel's intrinsics to write my HPC kernels, explicitly vectorized. At the same time I have been regularly disappointed by the quality of the DLP code generated by C/C++ compilers (GCC, clang, LLVM, etc., in case you ask, I can post specific examples).

From the Intrinsics Guide, it is clear that writing "manually" HPC kernels with intrinsics for modern platforms is no longer a sustainable option, unless I have an army of programmers. Too many versions and combinations: SSE4.1, AVX, AVX2, AVX512+flavors, FMA, SP, DP, half precision? It's just not sustainable if my target platforms are, let say, the most widespread ones since 2012.

I recently tried the Intel Offline Compiler for OpenCL (CPU). I wrote the kernel "a la CUDA" (i.e. scalar code, implicit vectorization), and to my surprise the generated assembly was very well vectorized! (Skylake, AVX2 + FMA in SP) The only limitation I encountered was the lack of builtin functions for data reductions/interworkitem-communication without relying on the shared memory (that would translate into CPU horizontal adds, or shuffles + min/max operations).

As pointed out by clemens and sschuberth the offline compiler is not really a solution unless I do not embrace fully OpenCL. Or I hack my caller code to comply to the calling convention of the generated assembly, which includes parameters that I would not need such as ndrange. Fully embracing OpenCL is not an option for me either, since for TLP I rely on OpenMP and Pthreads (and for ILP I rely on the hardware).

Update

First off, it's worth to recall that implicit vectorization and autovectorization are not the same thing. In fact, I lost my hope in autovectorization (as mentioned above). Not in the implicit vectorization.

One of the answers below is asking for some code examples. Here I provide a code example of a kernel implementing a third-order upwind scheme for the convection term of the NSE on a 3D structured block. It is worth to mention that this represents a trivial example since no SIMD inter-lane cooperation/communication is required.

800

asked Feb 01 '16 10:02

diegor

1 Answers

Intel SPMD Program Compiler.

At the present time, the best option is the Intel SPMD Program Compiler. ISPC is an open source compiler, its programming model relies on implicit vectorization (term borrowed from the Intel OpenCL SDK documentation) to output vectorized assembly code. ISPC maps source codes to SSE4.1, AVX, AVX2, KNC and KNL's AVX512 instructions for both SP/DP. ISPC's backend is LLVM.

For CFD kernels it simply delivers unmatched performance. For the portions of code that have to be scalar, one simply adds the "uniform" keyword to the associated variables. There are built-in functions for inter-lane communication such as shuffle, broadcast and reduce_add, etc.

Why is ISPC so fast compared to the other C++ compilers? My guess is that because the C/C++ compilers assume that nothing can be vectorized unless there is clear evidence of the opposite. ISPC assumes that every line of code is (independently) executed by all SIMD lanes, unless otherwise specified.

I wonder why ISPC is not widely embraced yet. Maybe it is because of his juvenile stage, but it showed already great capabilities (Embree, OSPray) in the CG/Scientific Visualization community. ISPC is a good option for writing HPC kernels as it seems to nicely bridge the performance-productivity gap.

Benchmark

For the trivial kernel example referenced in the question, the following results were obtained using GCC 4.9.X and ISPC 1.8.2. Performance is reported in terms of FLOPs per cycle.

enter image description here

ICC results are not reported herein (in terms of accessibility, is it 100% fair to report ICC against free and open-source compilers?). Nonetheless the maximum gain of ICC over GCC reporting in this case was about 4X, therefore not compromising the superiority of ISPC.

175

answered Oct 19 '22 09:10

diegor

Related questions
                            
                                Building a shared library using gcc on Linux and MinGW on Windows
                            
                                Shift a __m128i of n bits
                            
                                Is it valid to print the address of string in C
                            
                                Is it correct to compare a double to zero if you previously initialized it to zero?
                            
                                Return value of fgets()
                            
                                Why aren't pointers to member functions just memory address like data pointers
                            
                                Post Increment in while loop in C
                            
                                Is there a limit on the number of values that can be printed by a single call of printf?
                            
                                Why is it not OK to pass `char **` to a function that takes a `const char **` in C? [duplicate]
                            
                                Is there a way to get the filename from a `FILE*`? [duplicate]
                            
                                How to do AES decryption using OpenSSL
                            
                                Interrupting blocked read
                            
                                When to use QueueUserAPC()?
                            
                                Fast merge of sorted subsets of 4K floating-point numbers in L1/L2
                            
                                openCV Error: Assertion failed (scn == 3 || scn == 4)
                            
                                cmake ignores -D CMAKE_BUILD_TYPE=Debug
                            
                                GCC generate Canary or not?
                            
                                Elegant way of getting number of items for NS_ENUM
                            
                                GDB conditional break on function parameter
                            
                                Shift masked bits to the lsb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HPC programming language relying on implicit vectorization

Tags:

c

vectorization

hpc

opencl

Update

diegor

People also ask

1 Answers

Intel SPMD Program Compiler.

Benchmark

diegor

Recent Activity

Donate For Us