I have a 2D matrix mat
with 500 rows × 335 columns, and a data.frame dat
with 120425 rows. The data.frame dat
has two columns I
and J
, which are integers to index the row, column from mat
. I would like to add the values from mat
to the rows of dat
.
Here is my conceptual fail:
> dat$matval <- mat[dat$I, dat$J] Error: cannot allocate vector of length 1617278737
(I am using R 2.13.1 on Win32). Digging a bit deeper, I see that I'm misusing matrix indexing, as it appears that I'm only getting a sub-matrix of mat
, and not a single-dimension array of values as I expected, i.e.:
> str(mat[dat$I[1:100], dat$J[1:100]]) int [1:100, 1:100] 20 1 1 1 20 1 1 1 1 1 ...
I was expecting something like int [1:100] 20 1 1 1 20 1 1 1 1 1 ...
. What is the correct way to index a 2D matrix using indices of row, column to get the values?
To find the position of an element in an array, you use the indexOf() method. This method returns the index of the first occurrence the element that you want to find, or -1 if the element is not found. The following illustrates the syntax of the indexOf() method.
To access elements in a range of rows or columns, use the colon . For example, access the elements in the first through third row and the second through fourth column of A . An alternative way to compute r is to use the keyword end to specify the second column through the last column.
Almost. Needs to be offered to "[" as a two column matrix:
dat$matval <- mat[ cbind(dat$I, dat$J) ] # should do it.
There is a caveat: Although this also works for dataframes, they are first coerced to matrix-class and if any are non-numeric, the entire matrix becomes the "lowest denominator" class.
Using a matrix to index as DWin suggests is of course much cleaner, but for some strange reason doing it manually using 1-D indices is actually slightly faster:
# Huge sample data mat <- matrix(sin(1:1e7), ncol=1000) dat <- data.frame(I=sample.int(nrow(mat), 1e7, rep=T), J=sample.int(ncol(mat), 1e7, rep=T)) system.time( x <- mat[cbind(dat$I, dat$J)] ) # 0.51 seconds system.time( mat[dat$I + (dat$J-1L)*nrow(mat)] ) # 0.44 seconds
The dat$I + (dat$J-1L)*nrow(m)
part turns the 2-D indices into 1-D ones. The 1L
is the way to specify an integer instead of a double value. This avoids some coercions.
...I also tried gsk3's apply-based solution. It's almost 500x slower though:
system.time( apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat ) ) # 212
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With