I want to use data.table
to achieve a very simple task for a large dataset.
Calculate mean of val1 and val2 for each ID.
For details, please refer to the attached fake data.
library(data.table)
DT <- data.table(ID = paste0("ID",rep(1:5,each=2)),
level= rep(c("CTRL","CTRL","ID1","ID2","ID3"),2),
val1 = 1:10,
val2 = rnorm(10))
Here I want to calculate for each ID, the mean of val1 and val2.
Also notice that in each ID, there're different levels. But for each unique ID, I just want one mean incorporating the different levels, val1, and val2.
--- ID | Mean ---
-- ID1 | ...
-- ID2 | ...
-- ID3 | ...
I tried the following code, but it doesn't work.
topagents <- DT[, mean = mean(list(val1,val2)),
by = ID]
but it doesn't work.
I know how to do it in reshape2
, first melt
and then dcast
.
But the original dataset is relatively large with 20M rows and 12 fields, it takes quite a long time to do the calculation.
So I prefer to use data.table
or dplyr
.
data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
In my benchmarking project, Base R sorts a dataset much faster than dplyr or data.
dplyr is a package for making tabular data manipulation easier. tidyr enables you to swiftly convert between different data formats.
Encapsulate the calls to mean
in the list, rather than taking the mean of a list, which you can't do:
DT[, j=list(val1=mean(val1), val2=mean(val2)), by=ID]
ID val1 val2
1: ID1 1.5 0.1389794
2: ID2 3.5 0.3392179
3: ID3 5.5 -0.6336174
4: ID4 7.5 0.9941148
5: ID5 9.5 0.1324782
To get a single value, the mean of the val1
and val2
values, combine these and pass to mean
:
DT[, j=list(mean=mean(c(val1,val2))), by=ID]
ID mean
1: ID1 0.8194897
2: ID2 1.9196090
3: ID3 2.4331913
4: ID4 4.2470574
5: ID5 4.8162391
Using a list for the single element of j
here is an easy way to name the resulting column.
topagents <- DT[, mean(c(val1,val2)), by = ID]
mean can only take a vector, it doesn't understand a list.
Your question said "Calculate mean of val1 and val2 for each ID." But based on Mathew's answer maybe you wanted "Calculate means(plural) of val1 and val2 for each ID."?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With