Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is R row extraction slower on large sparse Matrix than when dividing it into smaller pieces and then extracting?

I am working with a 19089 x 9432 sparse Matrix of class "dgCMatrix" (let's call it M), and I have to extract each row to perform some calculations on it. I do this with a loop, so at each iteration I have to do something like currentrow <- M[i,] to apply the body of the loop on it. The calculations are very time consuming so I want to optimize it as much as possible, and I realized that if I first divide my matrix in small pieces (M[1:100,], M[101:200,], etc...), and that I do a loop on each of those smaller matrices (therefore calling currentrow <- current_smallM[i,] at each iteration), the loop is much more faster.

Here is a code example that I run to reproduce this :

library(Matrix)

N = 10000
M = 5000

# Creation of the large matrix (of class dgCMatrix)
largeMatrix <- Matrix(rnorm(N*M,mean=0,sd=1), byrow = TRUE, nrow = N, sparse = TRUE)


# We take into account the time for the creation of the smaller matrix, and then calculate the time to allocate the 200 rows to a variable
start.time = Sys.time()
smallMatrix = largeMatrix[100:200,]
for (i in 1:100){
    test <- smallMatrix[i,]
}
end.time = Sys.time()
print(end.time - start.time) # 0.47 secs


# Same allocations but working on the large matrix
start.time = Sys.time()
for (i in 100:200){
    test <- largeMatrix[i,]
}
end.time = Sys.time()
print(end.time - start.time) # 18.44 secs

You can see that the time difference is really huge... So I am really wondering :

  • Why is this happening?
  • Is there a more efficient way to store my data than dividing my matrix in smaller pieces?

Interestingly, I tested the same code with a matrix object (using largeMatrix <- matrix( rnorm(N*M,mean=0,sd=1), N, M)), and the results were totally different : 0.06 secs for the divided matrix, and 0.04 secs for the large matrix, so I am really wondering what is different with the sparse representation.

Note: I found a quite similar question here but it was with a different language, and (I think that) the solution does not apply here because it was due to an implicite type conversion, whereas here I am just extracting a row.

Thank you for your help!

like image 778
Alicia Rakoto Avatar asked Jul 12 '21 14:07

Alicia Rakoto


1 Answers

dgCMatrix is a compressed sparse column format. It has an indptr array that's got X entries where X is the number of columns in the matrix, and an index array that identifies the location of each non-zero value, which has N entries where N is the number of non-zero values in the array.

What this means is that every time you want to slice it by row it needs to traverse the entire index array and look for values that are within the range you want to slice. Using this smallMatrix = largeMatrix[100:200,] slices the array into a smaller array where the index array is much smaller, and can therefore be traversed more quickly.

Your real problem is that you're trying to get rows out of a data structure that gives you very inefficient slicing for rows and very efficient slicing for columns.

like image 74
CJR Avatar answered Nov 15 '22 09:11

CJR