Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate between columns in data.table or dplyr?

I want to use data.table to achieve a very simple task for a large dataset.

Calculate mean of val1 and val2 for each ID.

For details, please refer to the attached fake data.

library(data.table)
DT <- data.table(ID = paste0("ID",rep(1:5,each=2)),
      level= rep(c("CTRL","CTRL","ID1","ID2","ID3"),2),
      val1 = 1:10, 
      val2 = rnorm(10))

Here I want to calculate for each ID, the mean of val1 and val2.

Also notice that in each ID, there're different levels. But for each unique ID, I just want one mean incorporating the different levels, val1, and val2.

--- ID | Mean ---

-- ID1 | ...

-- ID2 | ...

-- ID3 | ...

I tried the following code, but it doesn't work.

topagents <- DT[, mean = mean(list(val1,val2)), 
                    by = ID]

but it doesn't work. I know how to do it in reshape2, first melt and then dcast.

But the original dataset is relatively large with 20M rows and 12 fields, it takes quite a long time to do the calculation.

So I prefer to use data.table or dplyr.

like image 375
Bigchao Avatar asked Jan 15 '14 04:01

Bigchao


People also ask

Is data table better than dplyr?

data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.

Can you use dplyr on data table?

Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.

Is dplyr faster than base R?

In my benchmarking project, Base R sorts a dataset much faster than dplyr or data.

What is the difference between Tidyr and dplyr?

dplyr is a package for making tabular data manipulation easier. tidyr enables you to swiftly convert between different data formats.


2 Answers

Encapsulate the calls to mean in the list, rather than taking the mean of a list, which you can't do:

DT[, j=list(val1=mean(val1), val2=mean(val2)), by=ID]
    ID val1       val2
1: ID1  1.5  0.1389794
2: ID2  3.5  0.3392179
3: ID3  5.5 -0.6336174
4: ID4  7.5  0.9941148
5: ID5  9.5  0.1324782

To get a single value, the mean of the val1 and val2 values, combine these and pass to mean:

DT[, j=list(mean=mean(c(val1,val2))), by=ID]
    ID      mean
1: ID1 0.8194897
2: ID2 1.9196090
3: ID3 2.4331913
4: ID4 4.2470574
5: ID5 4.8162391

Using a list for the single element of j here is an easy way to name the resulting column.

like image 62
Matthew Lundberg Avatar answered Oct 11 '22 12:10

Matthew Lundberg


topagents <- DT[, mean(c(val1,val2)), by = ID]

mean can only take a vector, it doesn't understand a list.

Your question said "Calculate mean of val1 and val2 for each ID." But based on Mathew's answer maybe you wanted "Calculate means(plural) of val1 and val2 for each ID."?

like image 38
JeremyS Avatar answered Oct 11 '22 14:10

JeremyS