Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I vectorize code when data is in a list?

I am in the process of optimizing my code, and I am running into some problems. I know that the greatest speed ups in R come from vectorizing code instead of using loops. However, I have my data in lists, and I am not sure if I can vectorize my code or not. I have tried using the apply functions (like lapply, vapply), but I read that these functions are just for writing cleaner code and are actually using loops under the hood!

Here are my three biggest bottlenecks in my code, though I do not think anything can be done for the first part.

1) Reading data

I work with batches of 1000 matrices of dimensions 277x349. This is the biggest bottleneck in my script, but I alleviated the problem a little bit by using the doMC package to take advantage of multiple cores with the foreach function. This results in a list containing 1000 277x349 matrices.

For the purposes of the question, say we have a list of 1000 matrices of dimensions 277 x 349

# Fake data
l <- list()
for(i in 1:1000) {
  l[[i]] <- matrix(rnorm(277*349), nrow=277, ncol=349)
}

2) Bottleneck #1

I need to make comparisons to some reference matrix (of same dimensions). This leads to comparing the 1000 matrices in my list to my reference matrix to get a vector of 1000 distances. If I know that the matrices are of the same dimensions, can I vectorize this step?

Here is some code:

# The reference matrix
r <- matrix(rnorm(277*349), nrow=277, ncol=349)
# The number of non NA values in matrix. Do not need to worry about this...
K <- 277*349

# Make a function to calculate distances
distance <- function(xi, xj, K, na.rm=TRUE) {
  sqrt(sum((xi - xj)^2, na.rm=na.rm)/K)
}

# Get a vector containing all the distances
d <- vapply(l, distance, c(0), xj=r, K=K)

This step is bearably fast using vapply, but it is the third slowest part of the code.

3) Bottleneck #2

I now want to make a weighted average matrix of the J "closest" matrices to my reference matrix. (There is a sorting step, but assume that d[1] < d[2] < ... < d[1000] for simplicity). I want to get the weighted average matrix for when J=1,2,...,1000

# Get the weighted matrix
weightedMatrix <- function(listOfData, distances, J) {
  # Calculate weights:
  w <- d[1:J]^{-2} / sum(d[1:J]^{-2})

  # Get the weighted average matrix
  # *** I use a loop here ***
  x_bar <- matrix(0, nrow=nrow(listOfData[[1]]), ncol=ncol(listOfData[[1]]))
  for(i in 1:J) {
    x_bar <- x_bar + {listOfData[[i]] * w[i]}
  }

  return(x_bar)
}

# Oh no! Another loop...
res <- list()
for(i in 1:length(l) ) {
  res[[i]] <- weightedMatrix(l, d, J=i)
}

I am a little stumped. I do not see an intuitive way to vectorize operations on a list of matrices.

The script that I am writing will be called fairly often, so even a little improvement can add up!


EDIT:

RE: 1) Reading data

I forgot to mention that my data is in a special format, so I have to use a special data reading function to read the data in R. The files are in netcdf4 format, and I am using the nc_open function from the package ncdf4 to access the files, and then I have to use the ncvar_get function to read the variable of interest. The nice thing is that the data in the files can be read from disk, and then I can read the data into memory with ncvar_get to do operations on them with R.

That being said, although I know the size of my matrices and how many of them I will have, I asked my question with a list of data because the foreach function that enables me to do parallel computing outputs the results from the parallel-ized loop in a list. I found that with the foreach function, the data reading step was about 3x faster.

I imagine that I can arrange the data as a 3d array afterwards, but maybe the time it takes to allocate the 3d array may take more time than it saves? I will have to try it tomorrow.


EDIT 2:

Here are some of the timings I took of my script.

Original Script:

[1] "Reading data to memory"
user  system elapsed 
176.063  44.070  26.611 

[1] "Calculating Distances"
user  system elapsed 
2.312   0.000   2.308 

[1] "Calculating the best 333 weighted matrices"
user  system elapsed 
63.697  28.495   9.092 

I made the following improvements thus far: (1) Pre-allocate the list before reading data, (2) Improved the weighted matrix calculations, as per Martin Morgan's suggestion.

[1] "Reading data to memory"
user  system elapsed 
192.448  38.578  27.872 

[1] "Calculating Distances"
user  system elapsed 
2.324   0.000   2.326 

[1] "Calculating all 1000 weighted matrices"
user  system elapsed 
1.376   0.000   1.374 

Some notes:

I use 12 cores in my foreach loop to read in the data (registerDoMC(12)). The whole script takes approximately 40s / 36s to run before / after the improvements.

The timing for my Bottleneck #2 has improved by quite a bit. Previously, I had been computing only the top third (i.e. 333) of the weighted matrices, but now the script can just calculate all the weighted matrices in a fraction of the original time.

Thanks for the help, I will try tweaking my code later to see if I can change my script to work with 3D arrays instead of lists. I am going to take some time now to verify the calculations just to be sure they work!

like image 868
ialm Avatar asked Jun 12 '13 23:06

ialm


1 Answers

My 'low hanging fruit' (scan; pre-allocate and fill) seem to be not relevant, so...

The operations in the distance calculation sort of look vectorized enough to me. Probably you can squeeze some extra speed out of doing a single distance calculation over all your matrices, but this probably makes the code less understandable.

The weightedMatrix calculation looks like there is room for improvement. Let's calculate

w <- d^(-2) / cumsum(d^(-2))

For a weighted matrix m I think the relationship between successive matrices is just m' = m * (1 - w[i]) + l[[i]] * w[i], so

res <- vector("list", length(l))
for (i in seq_along(l))
    if (i == 1L) {
        res[[i]] = l[[i]] * w[[i]]
    } else  {
        res[[i]] = res[[i - 1]] * (1 - w[[i]])  + l[[i]] * w[[i]]
    }

This changes the calculation of res from quadratic to linear. My thoughts about better than linear performance were just a (probably also misguided) hunch; I haven't pursued that.

Returning to pre-allocate and fill and @flodel's comment, we have

f0 <- function(n) {
    ## good: pre-allocate and fill
    l = vector("list", n)
    for (i in seq_along(l))
        l[[i]] = 1
    l
}

f1 <- function(n) {
    ## bad: copy and append
    l = list()
    for (i in seq_len(n))
        l[[i]] = 1
    l
}

which produce the same result

> identical(f0(100), f1(100))
[1] TRUE

but with different performance

> sapply(10^(1:5), function(i) system.time(f0(i))[3])
elapsed elapsed elapsed elapsed elapsed 
  0.000   0.000   0.002   0.014   0.134 
> sapply(10^(1:5), function(i) system.time(f1(i))[3])
elapsed elapsed elapsed elapsed elapsed 
  0.000   0.001   0.005   0.253  24.520 

Even though this does not matter for the scale of the current problem does not matter, it seems like one should adopt the better pre-allocate and fill strategy to avoid having to guess whether it's relevant or not. Better, use the *apply or in this case replicate family to avoid having to think about it

l <- replicate(1000, matrix(rnorm(277*349), nrow=277, ncol=349), simplify=FALSE)
like image 187
Martin Morgan Avatar answered Sep 19 '22 12:09

Martin Morgan