I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
To aggregate matrix columns by row names, we can use colSums with sapply and transpose the output. For example, if we have a matrix called M then the aggregate matrix columns by row names can be done using t(sapply(by(M,rownames(M),colSums),identity)).
Naming Rows and Columns of a Matrix in R Programming – rownames() and colnames() Function. rownames() function in R Language is used to set the names to rows of a matrix.
A data frame's rows can be accessed using rownames() method in the R programming language. We can specify the new row names using a vector of numerical or strings and assign it back to the rownames() method. The data frame is then modified reflecting the new row names.
Method 1: using colnames() method colnames() method in R is used to rename and replace the column names of the data frame in R. The columns of the data frame can be renamed by specifying the new column names as a vector. The new name replaces the corresponding old name of the column in the data frame.
A basic aggregate function takes multiple input values and returns a single value. Aggregation results in a table with fewer rows (as many rows as aggregates)... ... and these rows do not represent the same objects as the original table, with the latter representing groups of rows based on a partitioning attribute.
Because the aggregate function returns a value for each aggregate, the resulting table will have as many rows as there are aggregates. The nature of the objects represented by a row prior to the aggregation (here: an apple) will not be the same for the row following aggregation (here: a set of apples).
Data Manipulation in R In R, you can use the aggregate function to compute summary statistics for subsets of the data. This function is very similar to the tapply function, but you can also input a formula or a time series object and in addition, the output is of class data.frame.
The nature of the objects represented by a row prior to the aggregation (here: an apple) will not be the same for the row following aggregation (here: a set of apples). We’ll be looking ahead to the day when you are confronted with a Big Data problem, involving a huge data set that needs to be processed.
Here's a solution using by
and colSums
, but requires some fiddling due to the default output of by
.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils
. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow
solution and 100x faster than the by
solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix
, especially a sparse one, or if you were performing an aggregation besides sum
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With