Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Branch Prediction affect performance in R?

Some references:

This is a follow-up on this Why is processing a sorted array faster than processing an unsorted array?

The only post in r tag that I found somewhat related to branch prediction was this Why sampling matrix row is very slow?

Explanation of the problem:

I was investigating whether processing a sorted array is faster than processing an unsorted one (same as the problem tested in Java and C – first link) to see if branch prediction is affecting R in the same manner.

See the benchmark examples below:

set.seed(128)
#or making a vector with 1e7
myvec <- rnorm(1e8, 128, 128)  

myvecsorted <- sort(myvec)

mysumU = 0
mysumS = 0

SvU <- microbenchmark::microbenchmark(
  Unsorted = for (i in 1:length(myvec)) {
    
    if (myvec[i] > 128) {
      mysumU = mysumU + myvec[i]
    }
    
  } ,
  Sorted = for (i in 1:length(myvecsorted)) {
    
    if (myvecsorted[i] > 128) {
      mysumS = mysumS + myvecsorted[i]
    }
    
  } ,
  times = 10)

ggplot2::autoplot(SvU)

Question:

  • First, I want to know that why "Sorted" vector is not the fastest all the time and not by the same magnitude as expressed in Java?
  • Second, why the sorted execution time has a higher variation compared to one of the unsorted?

N.B. My CPU is an i7-6820HQ @ 2.70GHz Skylake, quad-core with hyperthreading.

Update:

To investigate the variation part, I did the microbenchmark with the vector of 100 million elements (n=1e8) and repeated the benchmark 100 times (times=100). Here's the associated plot with that benchmark.

Here's my sessioninfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstudioapi_0.10      reprex_0.3.0         cli_1.1.0            pkgconfig_2.0.3      evaluate_0.14        rlang_0.4.0         
[7] Rcpp_1.0.2           microbenchmark_1.4-7 ggplot2_3.2.1 
like image 764
M-- Avatar asked Oct 15 '19 16:10

M--


People also ask

Why is branching bad for performance?

On a conditional branch, it usually doesn't know ahead of time which path will be taken. So when this happens, the CPU has to stall until the decision has been resolved, and throws away everything in the pipeline that's behind the branch instruction. This lowers utilisation, and therefore performance.

What happens if a branch prediction fails?

In either case, if the branch is predicted incorrectly, there is a penalty that must be paid to undo the incorrect prediction and proceed down the proper path. (the misprediction penalty).

What is branch prediction in computer architecture?

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline.

How do I stop branching code?

I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.


1 Answers

Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.


R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)

The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.

Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.

Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.

The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.

Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.

  • Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
  • X86 prefetching optimizations: "computed goto" threaded code has more links.
  • https://comparch.net/2013/06/30/why-tage-is-the-best/
  • https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.

The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.

So extra branch mispredicts are a much smaller fraction of the total time because of interpreter overhead.


Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.

If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.

Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.

like image 105
Peter Cordes Avatar answered Sep 28 '22 07:09

Peter Cordes