I am working with a 19089 x 9432 sparse Matrix of class "dgCMatrix" (let's call it M
), and I have to extract each row to perform some calculations on it. I do this with a loop, so at each iteration I have to do something like currentrow <- M[i,]
to apply the body of the loop on it. The calculations are very time consuming so I want to optimize it as much as possible, and I realized that if I first divide my matrix in small pieces (M[1:100,]
, M[101:200,]
, etc...), and that I do a loop on each of those smaller matrices (therefore calling currentrow <- current_smallM[i,]
at each iteration), the loop is much more faster.
Here is a code example that I run to reproduce this :
library(Matrix)
N = 10000
M = 5000
# Creation of the large matrix (of class dgCMatrix)
largeMatrix <- Matrix(rnorm(N*M,mean=0,sd=1), byrow = TRUE, nrow = N, sparse = TRUE)
# We take into account the time for the creation of the smaller matrix, and then calculate the time to allocate the 200 rows to a variable
start.time = Sys.time()
smallMatrix = largeMatrix[100:200,]
for (i in 1:100){
test <- smallMatrix[i,]
}
end.time = Sys.time()
print(end.time - start.time) # 0.47 secs
# Same allocations but working on the large matrix
start.time = Sys.time()
for (i in 100:200){
test <- largeMatrix[i,]
}
end.time = Sys.time()
print(end.time - start.time) # 18.44 secs
You can see that the time difference is really huge... So I am really wondering :
Interestingly, I tested the same code with a matrix
object (using largeMatrix <- matrix( rnorm(N*M,mean=0,sd=1), N, M)
), and the results were totally different : 0.06 secs for the divided matrix, and 0.04 secs for the large matrix, so I am really wondering what is different with the sparse representation.
Note: I found a quite similar question here but it was with a different language, and (I think that) the solution does not apply here because it was due to an implicite type conversion, whereas here I am just extracting a row.
Thank you for your help!
dgCMatrix is a compressed sparse column format. It has an indptr
array that's got X entries where X is the number of columns in the matrix, and an index
array that identifies the location of each non-zero value, which has N entries where N is the number of non-zero values in the array.
What this means is that every time you want to slice it by row it needs to traverse the entire index
array and look for values that are within the range you want to slice. Using this smallMatrix = largeMatrix[100:200,]
slices the array into a smaller array where the index
array is much smaller, and can therefore be traversed more quickly.
Your real problem is that you're trying to get rows out of a data structure that gives you very inefficient slicing for rows and very efficient slicing for columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With