I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows
library(data.table) a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10)) b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10)) dt = merge(a,b,by=intersect(names(a),names(b)),all=T) dt$category = sample(letters[1:3],10,replace=T)
and I wondered if there was a more efficient way than the following to summarize the data.
summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T), d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]
I don't really want to type all 50 column calculations by hand and a eval(paste(...))
seems clunky somehow.
I had a look at the example below but it seems a bit complicated for my needs. thanks
how to summarize a data.table across multiple columns
For example, to sum one column of cells, you may use "=SUM(A2:A32)" or to sum two columns you may use "=SUM(A2:A32,B2:B32)." Press "Enter" to display your results in the selected cell. If you added two columns with the range argument, the formula displays the results of both ranges added together.
We can calculate the sum of multiple columns by using rowSums() and c() Function. we simply have to pass the name of the columns.
summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.
You can use a simple lapply
statement with .SD
dt[, lapply(.SD, sum, na.rm=TRUE), by=category ] category index a b z c d 1: c 19 51.13289 48.49994 42.50884 9.535588 11.53253 2: b 9 17.34860 20.35022 10.32514 11.764105 10.53127 3: a 27 25.91616 31.12624 0.00000 29.197343 31.71285
If you only want to summarize over certain columns, you can add the .SDcols
argument
# note that .SDcols also allows reordering of the columns dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ] category a c z 1: c 51.13289 9.535588 42.50884 2: b 17.34860 11.764105 10.32514 3: a 25.91616 29.197343 0.00000
This of course, is not limited to sum
and you can use any function with lapply
, including anonymous functions. (ie, it's a regular lapply
statement).
Lastly, there is no need to use i=T
and j= <..>
. Personally, I think that makes the code less readable, but it is just a style preference.
See ?.SD
, ?data.table
and its .SDcols
argument, and the vignette Using .SD for Data Analysis.
Also have a look at data.table
FAQ 2.1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With