Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarizing multiple columns with data.table

Tags:

r

data.table

I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows

library(data.table) a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10)) b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10)) dt = merge(a,b,by=intersect(names(a),names(b)),all=T) dt$category = sample(letters[1:3],10,replace=T) 

and I wondered if there was a more efficient way than the following to summarize the data.

summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),                      d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category] 

I don't really want to type all 50 column calculations by hand and a eval(paste(...)) seems clunky somehow.

I had a look at the example below but it seems a bit complicated for my needs. thanks

how to summarize a data.table across multiple columns

like image 537
Tahnoon Pasha Avatar asked May 13 '13 01:05

Tahnoon Pasha


People also ask

How do you summarize multiple columns?

For example, to sum one column of cells, you may use "=SUM(A2:A32)" or to sum two columns you may use "=SUM(A2:A32,B2:B32)." Press "Enter" to display your results in the selected cell. If you added two columns with the range argument, the formula displays the results of both ranges added together.

How do I sum multiple columns in R?

We can calculate the sum of multiple columns by using rowSums() and c() Function. we simply have to pass the name of the columns.

How do I summarize a column in R?

summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.


1 Answers

You can use a simple lapply statement with .SD

dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]     category index        a        b        z         c        d 1:        c    19 51.13289 48.49994 42.50884  9.535588 11.53253 2:        b     9 17.34860 20.35022 10.32514 11.764105 10.53127 3:        a    27 25.91616 31.12624  0.00000 29.197343 31.71285 

If you only want to summarize over certain columns, you can add the .SDcols argument

#  note that .SDcols also allows reordering of the columns dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ]      category        a         c        z 1:        c 51.13289  9.535588 42.50884 2:        b 17.34860 11.764105 10.32514 3:        a 25.91616 29.197343  0.00000 

This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).

Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.


Documentation

See ?.SD, ?data.table and its .SDcols argument, and the vignette Using .SD for Data Analysis.

Also have a look at data.table FAQ 2.1.

like image 147
Ricardo Saporta Avatar answered Sep 21 '22 12:09

Ricardo Saporta