Are there programming languages or language extensions that rely on implicit vectorization?
I would need something that make aggressive assumptions to generate good DLP/vectorized code, for SSE4.1, AVX, AVX2 (with or without FMA3/4) in single/double precision from scalar C code.
For the last 10 years I had fun relying on the Intel's intrinsics to write my HPC kernels, explicitly vectorized. At the same time I have been regularly disappointed by the quality of the DLP code generated by C/C++ compilers (GCC, clang, LLVM, etc., in case you ask, I can post specific examples).
From the Intrinsics Guide, it is clear that writing "manually" HPC kernels with intrinsics for modern platforms is no longer a sustainable option, unless I have an army of programmers. Too many versions and combinations: SSE4.1, AVX, AVX2, AVX512+flavors, FMA, SP, DP, half precision? It's just not sustainable if my target platforms are, let say, the most widespread ones since 2012.
I recently tried the Intel Offline Compiler for OpenCL (CPU). I wrote the kernel "a la CUDA" (i.e. scalar code, implicit vectorization), and to my surprise the generated assembly was very well vectorized! (Skylake, AVX2 + FMA in SP) The only limitation I encountered was the lack of builtin functions for data reductions/interworkitem-communication without relying on the shared memory (that would translate into CPU horizontal adds, or shuffles + min/max operations).
As pointed out by clemens and sschuberth the offline compiler is not really a solution unless I do not embrace fully OpenCL. Or I hack my caller code to comply to the calling convention of the generated assembly, which includes parameters that I would not need such as ndrange. Fully embracing OpenCL is not an option for me either, since for TLP I rely on OpenMP and Pthreads (and for ILP I rely on the hardware).
First off, it's worth to recall that implicit vectorization and autovectorization are not the same thing. In fact, I lost my hope in autovectorization (as mentioned above). Not in the implicit vectorization.
One of the answers below is asking for some code examples. Here I provide a code example of a kernel implementing a third-order upwind scheme for the convection term of the NSE on a 3D structured block. It is worth to mention that this represents a trivial example since no SIMD inter-lane cooperation/communication is required.
Fortran, the 1st wide-used high-level programming, used in HPC since 1954 and will likely be used for a long time, as there are still many Fortran projects and tons of libraries-canot-give-up around. Fortran handles multi-dimensional arrays very comfortably.
Fortran is a general-purpose language used for scientific calculations. It is known for its high performance and is used in ranking the fastest supercomputers. Fortran is widely used for numerical programming since it is faster.
At the present time, the best option is the Intel SPMD Program Compiler. ISPC is an open source compiler, its programming model relies on implicit vectorization (term borrowed from the Intel OpenCL SDK documentation) to output vectorized assembly code. ISPC maps source codes to SSE4.1, AVX, AVX2, KNC and KNL's AVX512 instructions for both SP/DP. ISPC's backend is LLVM.
For CFD kernels it simply delivers unmatched performance. For the portions of code that have to be scalar, one simply adds the "uniform" keyword to the associated variables. There are built-in functions for inter-lane communication such as shuffle, broadcast and reduce_add, etc.
Why is ISPC so fast compared to the other C++ compilers? My guess is that because the C/C++ compilers assume that nothing can be vectorized unless there is clear evidence of the opposite. ISPC assumes that every line of code is (independently) executed by all SIMD lanes, unless otherwise specified.
I wonder why ISPC is not widely embraced yet. Maybe it is because of his juvenile stage, but it showed already great capabilities (Embree, OSPray) in the CG/Scientific Visualization community. ISPC is a good option for writing HPC kernels as it seems to nicely bridge the performance-productivity gap.
For the trivial kernel example referenced in the question, the following results were obtained using GCC 4.9.X and ISPC 1.8.2. Performance is reported in terms of FLOPs per cycle.
ICC results are not reported herein (in terms of accessibility, is it 100% fair to report ICC against free and open-source compilers?). Nonetheless the maximum gain of ICC over GCC reporting in this case was about 4X, therefore not compromising the superiority of ISPC.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With