Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarize based on two grouping variables in R using data.table

Tags:

r

data.table

I'm trying to use data.table in R to summarize the following data table:

SiteNo Var1 Var2 Var3 ... Var18 Group
1      0.1 0.3  1         0.3     1
2      0.3 0.1  0.9       0.2     1
etc.

There are 668,944 observations, 43 sites, 3 groups, and 19 variables. I'd like to get the results of a function (e.g., mean) which summarizes each column/variable by both site and group. So there should be 43 sites x 3 groups x # of summary stats (e.g., mean). I've used the following code:

e.dt<-data.table(e)
setkey(e.dt, Group) # set key to group number

# get mean for each column/variable
e.dt.mean<-e.dt[,lapply(.SD,mean), by="SiteNo"]

Using the above, I get 43 sites, but not the 3 groups I was after. I could split the original data table into the three groups, but was wondering if there was a way of summarizing using two variables (SiteNo and Group) using data.table.

I'm still RTM on data.table, but so far I haven't found the answer to the above.

like image 538
Prophet60091 Avatar asked Dec 17 '12 18:12

Prophet60091


1 Answers

Try setting your key to both "Group" and "SiteNo":

From the example under ?key:

keycols <- c("SiteNo", "Group")
setkeyv(e.dt, keycols)

Then, use by as:

e.dt[, lapply(.SD,mean), by = key(e.dt)]

Alternatively, you can use:

e.dt[, lapply(.SD,mean), by = "SiteNo,Group"]

or

e.dt[, lapply(.SD, mean), by = list(SiteNo, Group)]
like image 118
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 25 '22 13:09

A5C1D2H2I1M1N2O1R2T1