I am in the process of optimizing my code, and I am running into some problems. I know that the greatest speed ups in R come from vectorizing code instead of using loops. However, I have my data in lists, and I am not sure if I can vectorize my code or not. I have tried using the apply
functions (like lapply
, vapply
), but I read that these functions are just for writing cleaner code and are actually using loops under the hood!
Here are my three biggest bottlenecks in my code, though I do not think anything can be done for the first part.
1) Reading data
I work with batches of 1000 matrices of dimensions 277x349. This is the biggest bottleneck in my script, but I alleviated the problem a little bit by using the doMC
package to take advantage of multiple cores with the foreach
function. This results in a list containing 1000 277x349 matrices.
For the purposes of the question, say we have a list of 1000 matrices of dimensions 277 x 349
# Fake data
l <- list()
for(i in 1:1000) {
l[[i]] <- matrix(rnorm(277*349), nrow=277, ncol=349)
}
2) Bottleneck #1
I need to make comparisons to some reference matrix (of same dimensions). This leads to comparing the 1000 matrices in my list to my reference matrix to get a vector of 1000 distances. If I know that the matrices are of the same dimensions, can I vectorize this step?
Here is some code:
# The reference matrix
r <- matrix(rnorm(277*349), nrow=277, ncol=349)
# The number of non NA values in matrix. Do not need to worry about this...
K <- 277*349
# Make a function to calculate distances
distance <- function(xi, xj, K, na.rm=TRUE) {
sqrt(sum((xi - xj)^2, na.rm=na.rm)/K)
}
# Get a vector containing all the distances
d <- vapply(l, distance, c(0), xj=r, K=K)
This step is bearably fast using vapply
, but it is the third slowest part of the code.
3) Bottleneck #2
I now want to make a weighted average matrix of the J "closest" matrices to my reference matrix. (There is a sorting step, but assume that d[1] < d[2] < ... < d[1000]
for simplicity). I want to get the weighted average matrix for when J=1,2,...,1000
# Get the weighted matrix
weightedMatrix <- function(listOfData, distances, J) {
# Calculate weights:
w <- d[1:J]^{-2} / sum(d[1:J]^{-2})
# Get the weighted average matrix
# *** I use a loop here ***
x_bar <- matrix(0, nrow=nrow(listOfData[[1]]), ncol=ncol(listOfData[[1]]))
for(i in 1:J) {
x_bar <- x_bar + {listOfData[[i]] * w[i]}
}
return(x_bar)
}
# Oh no! Another loop...
res <- list()
for(i in 1:length(l) ) {
res[[i]] <- weightedMatrix(l, d, J=i)
}
I am a little stumped. I do not see an intuitive way to vectorize operations on a list of matrices.
The script that I am writing will be called fairly often, so even a little improvement can add up!
EDIT:
RE: 1) Reading data
I forgot to mention that my data is in a special format, so I have to use a special data reading function to read the data in R. The files are in netcdf4 format, and I am using the nc_open
function from the package ncdf4
to access the files, and then I have to use the ncvar_get
function to read the variable of interest. The nice thing is that the data in the files can be read from disk, and then I can read the data into memory with ncvar_get
to do operations on them with R.
That being said, although I know the size of my matrices and how many of them I will have, I asked my question with a list of data because the foreach
function that enables me to do parallel computing outputs the results from the parallel-ized loop in a list. I found that with the foreach
function, the data reading step was about 3x faster.
I imagine that I can arrange the data as a 3d array afterwards, but maybe the time it takes to allocate the 3d array may take more time than it saves? I will have to try it tomorrow.
EDIT 2:
Here are some of the timings I took of my script.
Original Script:
[1] "Reading data to memory"
user system elapsed
176.063 44.070 26.611
[1] "Calculating Distances"
user system elapsed
2.312 0.000 2.308
[1] "Calculating the best 333 weighted matrices"
user system elapsed
63.697 28.495 9.092
I made the following improvements thus far: (1) Pre-allocate the list before reading data, (2) Improved the weighted matrix calculations, as per Martin Morgan's suggestion.
[1] "Reading data to memory"
user system elapsed
192.448 38.578 27.872
[1] "Calculating Distances"
user system elapsed
2.324 0.000 2.326
[1] "Calculating all 1000 weighted matrices"
user system elapsed
1.376 0.000 1.374
Some notes:
I use 12 cores in my foreach
loop to read in the data (registerDoMC(12)
). The whole script takes approximately 40s / 36s to run before / after the improvements.
The timing for my Bottleneck #2 has improved by quite a bit. Previously, I had been computing only the top third (i.e. 333) of the weighted matrices, but now the script can just calculate all the weighted matrices in a fraction of the original time.
Thanks for the help, I will try tweaking my code later to see if I can change my script to work with 3D arrays instead of lists. I am going to take some time now to verify the calculations just to be sure they work!
My 'low hanging fruit' (scan
; pre-allocate and fill) seem to be not relevant, so...
The operations in the distance calculation sort of look vectorized enough to me. Probably you can squeeze some extra speed out of doing a single distance calculation over all your matrices, but this probably makes the code less understandable.
The weightedMatrix calculation looks like there is room for improvement. Let's calculate
w <- d^(-2) / cumsum(d^(-2))
For a weighted matrix m
I think the relationship between successive matrices is just m' = m * (1 - w[i]) + l[[i]] * w[i]
, so
res <- vector("list", length(l))
for (i in seq_along(l))
if (i == 1L) {
res[[i]] = l[[i]] * w[[i]]
} else {
res[[i]] = res[[i - 1]] * (1 - w[[i]]) + l[[i]] * w[[i]]
}
This changes the calculation of res
from quadratic to linear. My thoughts about better than linear performance were just a (probably also misguided) hunch; I haven't pursued that.
Returning to pre-allocate and fill and @flodel's comment, we have
f0 <- function(n) {
## good: pre-allocate and fill
l = vector("list", n)
for (i in seq_along(l))
l[[i]] = 1
l
}
f1 <- function(n) {
## bad: copy and append
l = list()
for (i in seq_len(n))
l[[i]] = 1
l
}
which produce the same result
> identical(f0(100), f1(100))
[1] TRUE
but with different performance
> sapply(10^(1:5), function(i) system.time(f0(i))[3])
elapsed elapsed elapsed elapsed elapsed
0.000 0.000 0.002 0.014 0.134
> sapply(10^(1:5), function(i) system.time(f1(i))[3])
elapsed elapsed elapsed elapsed elapsed
0.000 0.001 0.005 0.253 24.520
Even though this does not matter for the scale of the current problem does not matter, it seems like one should adopt the better pre-allocate and fill strategy to avoid having to guess whether it's relevant or not. Better, use the *apply
or in this case replicate
family to avoid having to think about it
l <- replicate(1000, matrix(rnorm(277*349), nrow=277, ncol=349), simplify=FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With