Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate rows in a large matrix by rowname

Tags:

r

aggregate

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:

> M
  a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
   a b c d
 1 3 4 6 2
 2 3 0 1 2
 3 4 2 5 2

The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.

like image 765
Christian Avatar asked Nov 15 '11 16:11

Christian


People also ask

How do you aggregate a matrix?

To aggregate matrix columns by row names, we can use colSums with sapply and transpose the output. For example, if we have a matrix called M then the aggregate matrix columns by row names can be done using t(sapply(by(M,rownames(M),colSums),identity)).

How to name rows in a matrix in R?

Naming Rows and Columns of a Matrix in R Programming – rownames() and colnames() Function. rownames() function in R Language is used to set the names to rows of a matrix.

How to give names to rows in R?

A data frame's rows can be accessed using rownames() method in the R programming language. We can specify the new row names using a vector of numerical or strings and assign it back to the rownames() method. The data frame is then modified reflecting the new row names.

How to change column names in a matrix in R?

Method 1: using colnames() method colnames() method in R is used to rename and replace the column names of the data frame in R. The columns of the data frame can be renamed by specifying the new column names as a vector. The new name replaces the corresponding old name of the column in the data frame.

What is an aggregate function in SQL Server?

A basic aggregate function takes multiple input values and returns a single value. Aggregation results in a table with fewer rows (as many rows as aggregates)... ... and these rows do not represent the same objects as the original table, with the latter representing groups of rows based on a partitioning attribute.

Why does the aggregate function return more rows than aggregates?

Because the aggregate function returns a value for each aggregate, the resulting table will have as many rows as there are aggregates. The nature of the objects represented by a row prior to the aggregation (here: an apple) will not be the same for the row following aggregation (here: a set of apples).

How do you use aggregate data in R?

Data Manipulation in R In R, you can use the aggregate function to compute summary statistics for subsets of the data. This function is very similar to the tapply function, but you can also input a formula or a time series object and in addition, the output is of class data.frame.

What is the difference between row prior and row following aggregation?

The nature of the objects represented by a row prior to the aggregation (here: an apple) will not be the same for the row following aggregation (here: a set of apples). We’ll be looking ahead to the day when you are confronted with a Big Data problem, involving a huge data set that needs to be processed.


2 Answers

Here's a solution using by and colSums, but requires some fiddling due to the default output of by.

M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
  V1 V2 V3
1  3  9 15
2  3  6  9
like image 176
James Avatar answered Oct 22 '22 13:10

James


There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:

N <- 10000

m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)

> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
                                                  expr        min         lq       mean     median         uq        max neval
 a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642    10
                                  b <- combineByRow(m)  634.96542  689.54724  759.87833  732.37424  866.22673  923.15491    10
                c <- aggregate.Matrix(m, row.names(m))   42.26674   44.60195   53.62292   48.59943   67.40071   70.40842    10

> identical(as.vector(a),as.vector(c))
[1] TRUE

EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.

like image 40
Craig Avatar answered Oct 22 '22 11:10

Craig