Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does R leverage SIMD when doing vectorized calculations?

Given a dataframe like this in R:

+---+---+
| X | Y |
+---+---+
| 1 | 2 |
| 2 | 4 |
| 4 | 5 |
+---+---+

If a vectorized operation is performed on this dataframe, like so:

data$Z <- data$X * data$Y

Will this leverage the processor's single-instruction-multiple-data (SIMD) capabilities to optimize the performance? This seems like the perfect case for it, but I cannot find anything that confirms my hunch.

like image 526
Jochen van Wylick Avatar asked May 13 '16 14:05

Jochen van Wylick


People also ask

What is vectorization SIMD?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

How does SIMD work?

SIMD is short for Single Instruction/Multiple Data, while the term SIMD operations refers to a computing method that enables processing of multiple data with a single instruction. In contrast, the conventional sequential approach using one instruction to process each individual data is called scalar operations.

Does GCC use SIMD?

The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.

What are vector intrinsics?

What are vector intrinsics? To a programmer, intrinsics look just like regular library functions; you include the relevant header, and you can use the intrinsic. To add four float numbers to another four numbers, use the _mm_add_ps intrinsic in your code.


2 Answers

I just got a "good answer" badge two years after my initial answer. Thanks for acknowledging the quality of this answer. In return, I would substantially enrich the original contents. This answer / article is now intended for any R user that wants to try out SIMD. It has the following sections:

  • Background: what is SIMD?
  • Summary: how can we leverage SIMD for our assembly or compiled code?
  • R side: how can R possibly use SIMD?
  • Performance: is vector code always faster than scalar code?
  • Writing R extensions: write compiled code with OpenMP SIMD

Background: what is SIMD?

Many R programmers may not know SIMD if they don't write assembly or compiled code.

SIMD (single instruction, multiple data) is a data-level parallel processing technology that has a very long history. Before personal computers were around, SIMD unambiguously referred to the vector processing in a vector processor, and was the main route to high-performance computing. When personal computers later came into the market, they didn't have any features resembling those of a vector processor. However, as the demands for processing multi-media data grew higher and higher, they began to have vector registers and corresponding sets of vector instructions to use those registers for vector data load, vector data arithmetics and vector data store. The capacity of vector registers is getting bigger, and the functionality of vector instruction sets are also increasingly versatile. Till today, they are able to do stream load / store, strided load / store, scattered load / store, vector elements shuffling, vector arithmetics (including fused arithmetics like fused multiply-add), vector logical operations, masking, etc. So they are more and more alike a mini vector processors of the old days.

Although SIMD has been with personal computers for nearly two decades, many programmers are unaware of it. Instead, many are familiar with thread-level parallelism like multi-core computing (which can be referred to as MIMD). So if you are new to SIMD, you are highly recommended to watch this YouTube video Utilizing the other 80% of your system's performance: Starting with Vectorization by Ulrich Drepper from Red Hat Linux.

Since vector instruction sets are extensions to the original architecture instruction sets, you have to invest extra efforts to use them. If you are to write assembly code, you can call these instructions straightaway. If you are to write compiled code like C, C++ and Fortran, you have to write inline assembly or use vector intrinsics. A vector intrinsic appears like a function, but it is in fact an inline assembly mapping to a vector assembly instruction. These intrinsics (or "functions") are not part of standard libraries of a compiled language; they are provided by the architecture / machine. To use them, we need including appropriate header files and compiler-specific flags.

Let's first define the following for ease of later discussions:

  1. Writing assembly or compiled code that does not use vector instruction sets is called "writing scalar code";
  2. Writing assembly or compiled code that uses vector instruction sets is called "writing vector code".

So these two paths are "write A, get A" and "write B, get B". However, compilers are getting stronger and there is another "write A, get B" path:

  1. They have a power to translate your written scalar compiled code to vector assembly code, a compiler optimization called "auto-vectorization".

Some compilers like GCC considers auto-vectorization as part of highest-level optimization, and is enabled by flag -O3; while other more aggressive compiles like ICC (intel C++ compiler) and clang would enable it at -O2. Auto-vectorization can also be directly controlled by specific flags. For example, GCC has -ftree-vectorize. When exploiting auto-vectorization, it is advised to further hint compilers to taylor vector assembly code for machine. For example, for GCC we may do -march=native and for ICC we use -xHost. This makes sense, because even on the x86-64 architecture family, later microarchitectures come with more vector instruction sets. For example, sandybridge supports vector instruction sets up to AVX, haswell further supports AVX2 and FMA3 and skylake further supports AVX-512. Without -march=native, GCC only generates vector instructions using instruction sets up to SSE2, which is a much smaller subset common to all x86-64.


Summary: how can we leverage SIMD for our assembly or compiled code?

There are five ways to implement SIMD:

  1. Writing machine-specific vector assembly code directly. For example, on x86-64 we use SSE / AVX instruction sets, and on ARM architectures we use NEON instruction sets.

    • Pros: Code can be hand-tuned for better performance;
    • Cons: We have to write different versions of assembly code for different machines.
  2. Writing vector compiled code by using machine-specific vector intrinsics and compiling it with compiler-specific flags. For example, on x86-64 we use SSE / AVX intrinsics, and for GCC we set -msse2, -mavx, etc (or simply -march=native for auto-detection). A variant of this option is to write compiler-specific inline-assembly. For example, introduction to GCC's inline assembly can be found here.

    • Pros: Writing compiled code is easier than writing assembly code, and the code is more readable hence easier to maintain;
    • Cons: We have to write different versions of code for different machines, and adapt Makefile for different compilers.
  3. Writing vector compiled code by using compiler-specific vector extensions. Some compilers have defined their own vector data type. For example, GCC's vector extensions can be found here;

    • Pros: We don't need to worry about the difference across architectures, as compiler can generate machine-specific assembly;
    • Cons: We have to write different versions of code for different compilers, and adapt Makefile likewise.
  4. Writing scalar compiled code and using compiler-specific flags for auto-vectorization. Optionally we can insert compiler-specific pragmas along our compiled code to give compilers more hints on, say, data alignment, loop unrolling depth, etc.

    • Pros: Writing scalar compiled code is easier than writing vector compiled code, and is more readable to broad audience.
    • Cons: We have to adapt Makefile for different compilers, and in case we have used pragmas, they also need be versioned.
  5. writing scalar compiled code and inserting OpenMP pragmas (requiring OpenMP 4.0+) #pragma opm simd.

    • Pros: same as option 4, and additionaly, we can use a single version of pragmas as many main stream compilers support OpenMP standard;
    • Cons: We have to adapt Makefile for different compilers as they may have different flags to enable OpenMP and machine-specific tuning.

From top to bottom, programmers progressively do less and compilers do increasingly more. Implementing SIMD is interesting, but this article unfortunately does not have the room for a decent coverage with examples. I would provide the most informative references I found.

  • For options 1 and 2 on x86-64, SSE / AVX intrinsics is definitely the best reference card, but not the right place to start learning these instructions. Where to start is individual-specific. I picked up intrinsics / assembly from BLISLab when I tried to write my own high-performance DGEMM (to be introduced later). After digesting example code over there I started practising, and posted a few questions on StackOverflow or CodeReview when I got stucked.

  • For options 4, a good explanation is given by A guide to auto-vectorization with intel C++ compilers. Although the manual is for ICC, the principle of how auto-vectorization works applies to GCC as well. The official website for GCC's auto-vectorization is so out-dated, and this presentation slide is more useful: GCC auto-vectorization.

  • For option 5, there is a very good technical report by Oak Ridge National Laboratory: Effective Vectorization with OpenMP 4.5.

In terms of portability,

  • Options 1 to 3 are not easily portable, because the version of vector code depends on machine and / or compiler;

  • Option 4 is much better as we get rid of machine-dependency, but we still have problem with compiler-dependency;

  • Option 5 is very close to portable, as adapting Makefile is way much easier than adapting code.

In terms of performance, conventionally it is believed that option 1 is the best, and performance would degrade as we move downward. However, compilers are getting better, and newer machines have hardware improvement (for example, the performance penalty for unaligned vector load is smaller). So auto-vectorization is very positive. As part of my own DGEMM case study, I found that on an Intel Xeon E5-2650 v2 workstation with a peak performance of 18 GFLOPs per CPU, GCC's auto-vectorization has attained 14 ~ 15 GFLOPs which is rather impressive.


R side: how can R possibly use SIMD?

R can only use SIMD by calling compiled code that exploits SIMD. Compiled code in R has three sources:

  1. R packages with "base" priority, like base, stats, utils, etc, that come with R's source code;
  2. Other packages on CRAN that require compilation;
  3. External scientific libraries like BLAS and LAPACK.

Since R software itself is portable across architectures, platforms and operating systems, and CRAN policy expects that an R package to be equally portable, compiled code in sources 1 and 2 can not be written in assembly code or compiler-dependent compiled code, ruling out options 1 to 3 for SIMD implementation. Auto-vectorization is the only chance left for R to leverage SIMD.

If you have built R with compiler's auto-vectorization enabled, compiled code from sources 1 can exploit SIMD. In fact, although R is written in a portable way, you can tune it for your machine when building it. For example, I would do icc -xHost -O2 with ICC or gcc -march=native -O2 -ftree-vectorize -ffast-math with GCC. Flags are set at R's build time and recorded in RHOME/etc/Makeconf (on Linux). Usually people would just do a quick build, so flag configurations are auto-decided. The result can be different depending your machine and your default compiler. On a Linux machine with GCC, optimization flag is often automatically set at -O2, hence auto-vectorization is off; instead, on a Mac OS X machine with clang, auto-vectorization is on at -O2. So I suggest you checking your Makeconf.

Flags in Makeconf are used when you run R CMD SHLIB, invoked by R CMD INSTALL or install.packages when installing CRAN R packages that needs compilation. By default, if Makeconf says that auto-vectorization is off, compiled code from source 2 can not leverage SIMD. However, it is possible to override Makeconf flags by providing a user Makevars file (like ~/.R/Makevars on Linux), so that R CMD SHLIB can take these flags and auto-vectorize compiled code from source 2.

BLAS and LAPACK are not part of R project or CRAN mirror. R simply takes it as it is, and does not even check whether it is a valid one! For example, on Linux if you alias your BLAS library to an arbitrary library foo.so, R will "stupidly" load foo.so instead on its startup and cause you trouble! The loose relationship between R and BLAS makes it easy to link different versions of BLAS to R so that benchmarking different BLAS libraries in R becomes straightforward (or course you have to restart R after you update the linkage). For Linux users with root privilege, switching between different BLAS libraries are recommoned by using sudo update-alternatives --config. If you don't have root privilege, this thread on StackOverflow will help you: Without root access, run R with tuned BLAS when it is linked with reference BLAS.

In case you don't know what BLAS is, here is brief introduction. BLAS originally referred to a coding standard for vector-vector, matrix-vector and matrix-matrix computations in scientific computations. For example, it was recommended that a general matrix-matrix multiplication should be C <- beta * C + alpha * op(A) %*% op(B), known as DGEMM. Note that this operation is more than just C <- A %*% B, and the motivation of this design was to maximize code reuse. For example, C <- C + A %*% B, C <- 2 * C + t(A) %*% B, etc can all be computed using DGEMM. A model implementation using FORTRAN 77 is provided with such standard for a reference, and this model library is commonly known as the reference BLAS library. Such library is static; it is there to motivate people to tune its performance for any specific machines. BLAS optimization is actually a very difficult job. In the end of optimization, everything changes except its user-interface. I.e., everything inside a BLAS function is changed, expect that you still call it in the same way. The various optimized BLAS libraries are known as tuned BLAS libraries, and include ATLAS, OpenBLAS or Intel MKL for example. All tuned BLAS libraries exploit SIMD as part of their optimization. Optimized BLAS library is remarkably faster than the reference one, and the performance gap will be increasely wider for new machines.

R relies on BLAS. For example, the matrix-matrix multiply operator "%*%" in R will call DGEMM. Functions crossprod, tcrossprod are also mapped to DGEMM. BLAS lies in the centre of scientific computations. Without BLAS, R would largely be broken. It is advocated so much to link an optimized BLAS library to R. It used to be difficult to check which BLAS library is linked to R (as this can be obscured by alias), but from R 3.4.0 this is no longer the case. sessionInfo() will show the full paths to the library or executable files providing the BLAS / LAPACK implementations currently in use (not available on Windows).

LAPACK is a more advanced scientific library built on top of BLAS. R relies on LAPACK for various matrix factorizations. For example, qr(, pivot = TRUE), chol, svd and eigen in R are mapped to LAPACK for QR factorization, Cholesky factorization, singular value decomposition and eigen decomposition. Note that all tuned BLAS libraries include a clone of LAPACK, so if R is linked to a tuned BLAS library, sessionInfo() will show that both libraries come from the same path; instead, if R is linked to the reference BLAS library, sessionInfo() will have two difference paths for BLAS and LAPACK. There have been plenty of questions taged r regarding the drastic performance difference of matrix multiplication across platforms, like Large performance differences between OS for matrix computation. In fact, if you just look at the output of sessionInfo(), you get an immediate clue that R is linked to a tuned BLAS on the first platform and a reference BLAS on the second.


Performance: is vector code always faster than scalar code?

Vector code looks fast, but they may not be realistically faster than scalar code. Here is a case study: Why is this SIMD multiplication not faster than non-SIMD multiplication?. And what a coincidence, the vector operation examined there is exactly what OP here took for example: Hadamard product. People often forget that the processing speed of CPU is not the deciding factor for practical performance. If data can not be transported from memory to CPU as fast as CPU requests, a CPU would just sit there and wait for most of the time. The Hadamard product example just falls into this situation: for every multiplication, three data must be fetched from memory, so Hadamard product is a memory-bound operation. The processing power of a CPU can only be realized, when substantially more arithmetics are done than the number of data movement. The classic matrix-matrix multiplication in BLAS belongs to this case, and this explains why SIMD implementation from a tuned BLAS library is so rewarding.

In light of this, I don't think you need to worry that much if you did not build your R software with compiler auto-vectorization turned on. It is hard to know whether R will really be faster.


Writing R extensions: write compiled code with OpenMP SIMD

If you decide to make contributions to CRAN by writing your own R packages, you can consider using SIMD option 5: OpenMP auto-vectorization if some section of your compiled code can benefit from SIMD. The reason for not choosing option 4, is that when you write a distributable package, you have no idea of what compiler will be used by an end-user. So there is no way you can write compiler-specific code and get it published on CRAN.

As we pointed out earlier in the SIMD options list, using OpenMP SIMD requires us adapting Makefile. In fact, R makes this very easy for you. You never need to write a Makefile alongside an R package. All you need is a Makevars file. When your package is compiled, compiler flags specified in your package Makevars and the RHOME/etc/Makeconf in the user's machine will be passed to R CMD SHLIB. Although you don't know what compiler that user might be using, RHOME/etc/Makeconf knows! All you need to do is to specify in your package Makevars that you want OpenMP support.

The only thing you can't do in your package Makevars is giving hint for machine-specific tuning. You may instead advise your package users to do the following:

  • If the RHOME/etc/Makeconf on the user's machine already have such tuning configuration (that is, the user have configured flags when they built R), your compiled code should be transformed to the tuned vector assembly code and there is nothing further to do;
  • Otherwise, you need to advise users to edit there personal Makevars file (like ~/.R/Makevars on Linux). You need to produce a table (maybe in your package vignette or documentation) about what tuning flags should be set for what compilers. Say -xHost for ICC and -march=native for GCC.
like image 154
Zheyuan Li Avatar answered Oct 19 '22 22:10

Zheyuan Li


Well, there is little-known R distribution from Microsoft (artist, formerly known as Revolution R), which could be found here

It comes with Intel MKL library, which utilizes multiple threads and vector operations as well (you have to run Intel CPU though), and it really helps with matrices, things like SVD, etc

Unless you're willing to write C/C++ code with SIMD intrinsics using Rcpp or similar interfaces, Microsoft R is your best bet to utilize SIMD

Download and try it

like image 28
Severin Pappadeux Avatar answered Oct 19 '22 22:10

Severin Pappadeux