Based on what I've read before, vectorization is a form of parallelization known as SIMD. It allows processors to execute the same instruction (such as addition) on an array simultaneously.
However, I got confused when reading The Relationship between Vectorized and Devectorized Code regarding Julia's and R's vectorization performance. The post claims that devectorized Julia code (via loops) is faster than the vectorized code in both Julia and R, because:
This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its magic.
It claims that R turns vectorized code, written in R, into devectorized code in C. If vectorization is faster (as a form of parallelization), why would R devectorize the code and why is that a plus?
"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function cumsum
as an example. On entry, R interpreter sees that a vector x
is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".
Not just the cumsum
function that takes a vector and returns a vector is seen as "vectorization" in R, functions like sum
that takes a vector and returns a scalar is also a "vectorization".
Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.
x <- runif(1e+7)
## R loop
system.time({
sumx <- 0
for (x0 in x) sumx <- sumx + x0
sumx
})
# user system elapsed
# 1.388 0.000 1.347
## C loop
system.time(sum(x))
# user system elapsed
# 0.032 0.000 0.030
Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.
R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?
More on "vectorization" in R
I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.
This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.
Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R. rowSums(A * B)
is a "vectorization" in R, as both "*"
and rowSums
are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrix C = A * B
into RAM.
Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar a
to a matrix A
by A + a
, what really happens is that a
is first replicated to be a matrix B
that has the same dimension with A
, i.e., B <- matrix(a, nrow(A), ncol(A))
, then an addition between two matrices are calculated: A + B
. Clearly the generation of the temporary matrix B
is undesired, but sorry, you can't do it better unless you write your own C function for A + a
and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.
To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function outer
. I have written many answers using this function, but it is particularly memory-unfriendly.
I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.
c(crossprod(x, y))
is better than sum(x * y)
.I think it is worth to note that the post you are referring to does not cover all current functionality of vectorization in Julia.
The important thing is that vectorization in Julia is implemented in Julia, as opposed to R, where it is implemented outside of the language. This is explained in detail in this post: https://julialang.org/blog/2017/01/moredots.
The consequence of the fact that Julia can perform fusion of any sequence of broadcasted operations into a single loop. In other languages that provide vectorization such a fusion is possible only if explicitly implemented.
In summary:
EDIT:
Following the comment of 李哲源 here is an example showing that Julia is able to avoid any allocations if you want to increase all elements of a vector x
by 1
:
julia> using BenchmarkTools
julia> x = rand(10^6);
julia> @benchmark ($x .+= 1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 819.230 μs (0.00% GC)
median time: 890.610 μs (0.00% GC)
mean time: 929.659 μs (0.00% GC)
maximum time: 2.802 ms (0.00% GC)
--------------
samples: 5300
evals/sample: 1
In the code .+=
performs addition in place (adding $
in front of the expression is only needed for benchmarking, in normal code it would be x .+= 1
). And we see that no memory allocation was done.
If we compare this to a possible implementation in R:
> library(microbenchmark)
> x <- runif(10^6)
> microbenchmark(x <- x + 1)
Unit: milliseconds
expr min lq mean median uq max neval
x <- x + 1 2.205764 2.391911 3.999179 2.599051 5.061874 30.91569 100
we can see that it not only saves memory, but also leads to a faster execution of the code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With