Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table efficient alternative to grouped assignment as DT[ ,x:=f(y),by=z]?

Tags:

r

data.table

I am looking for the best alternative to the not yet implemented (to my knowledge) assignment by reference in a data.table by groups. Using the data.table example,

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
     x y v
[1,] a 1 1
[2,] a 3 2
[3,] a 6 3
[4,] b 1 4
[5,] b 3 5
[6,] b 6 6
[7,] c 1 7
[8,] c 3 8
[9,] c 6 9

I want to add a new column z, containing f(y,v) grouped by values of x (lets take f(y,v)=mean(y)+v). Note that I do not want to print or store the result of this computation as in

DT[,mean(y)+v,by=x]
      x        V1
 [1,] a  4.333333
 [2,] a  5.333333
 [3,] a  6.333333
 [4,] b  7.333333
 [5,] b  8.333333
 [6,] b  9.333333
 [7,] c 10.333333
 [8,] c 11.333333
 [9,] c 12.333333

but rather I want to add the result to DT:

     x y v        V1
[1,] a 1 1  4.333333
[2,] a 3 2  5.333333
[3,] a 6 3  6.333333
[4,] b 1 4  7.333333
[5,] b 3 5  8.333333
[6,] b 6 6  9.333333
[7,] c 1 7 10.333333
[8,] c 3 8 11.333333
[9,] c 6 9 12.333333

my data.table has 262 MB, such that

DT <- DT[,transform(.SD,mean(y)+v),by=x]

is not an option, since I cannot fit DT twice in memory (which is implied by the copy operation, I think). Fact is I've never seen that operation finish.

What alternatives do I have (until data.table comes with DT[,z:=mean(y)+v,by=x])?

I just read about DT[newDT]. What's wrong here?

newDT <- DT[,mean(y)+v,by=x]
      x        V1
 [1,] a  4.333333
 [2,] a  5.333333
 [3,] a  6.333333
 [4,] b  7.333333
 [5,] b  8.333333
 [6,] b  9.333333
 [7,] c 10.333333
 [8,] c 11.333333
 [9,] c 12.333333

(which is doable memory wise.) then:

> DT[newDT]
setkey(DT,x)
setkey(newDT,x)
x y v        V1
a 1 1  4.333333
a 3 2  4.333333
a 6 3  4.333333
a 1 1  5.333333
a 3 2  5.333333
a 6 3  5.333333
a 1 1  6.333333
a 3 2  6.333333
a 6 3  6.333333
b 1 4  7.333333
b 3 5  7.333333
b 6 6  7.333333
b 1 4  8.333333
b 3 5  8.333333
b 6 6  8.333333
b 1 4  9.333333
b 3 5  9.333333
b 6 6  9.333333
c 1 7 10.333333
c 3 8 10.333333
c 6 9 10.333333
c 1 7 11.333333
c 3 8 11.333333
c 6 9 11.333333
c 1 7 12.333333
c 3 8 12.333333
c 6 9 12.333333

but that's not what I want. What's the mistake here?

like image 767
Florian Oswald Avatar asked May 24 '12 00:05

Florian Oswald


People also ask

What are the packages that are dependent on data table in R?

In CRAN, there are more than 200 packages that are dependent on data.table which makes it listed in the top 5 R's package. The first parameter of data.table i refers to rows. It implies subsetting rows.

How to group data tables in R?

Now, let’s move on to the second major and awesome feature of R data.table: grouping using by . In base R, grouping is accomplished using the aggregate() function.

What is the difference between groupby('Dept') and Count()?

Explanation: groupby (‘DEPT’)groups records by department, and count () calculates the number of employees in each group. You group records by multiple fields and then perform aggregate over each group. We handle it in a similar way.

What is the use of table in R?

Conclusion data.table is a package is used for working with tabular data in R. It provides the efficient data.table object which is a much improved version of the default data.frame. It is super fast and has intuitive and terse syntax.


2 Answers

DT[, xm := ave(y, x, FUN=mean) + v]
like image 59
IRTFM Avatar answered Nov 09 '22 03:11

IRTFM


I would do the following:

DT[, list(fvy = mean(y)), by="x"][DT][, fvy := fvy + v]

So basically, I split it up into two parts: First, I compute the mean of y and add that to DT, then I add v to the mean of y. Memory-wise I'm not sure if this really helps, but there is a good chance the author will have a look and let us know ;-)

Regarding your question why it's not working: Basically, you end up with two data.tables that you want to merge: DT and newDT. Both data.tables have every key three times. So obviously, when you merge them, every combination is in the results and that's why you get a data.table with 9 a, b, and c's.

So to do it your way which is quite similar to mine you need a second key:

newDT <- DT[,list(fvy=mean(y)+v, v),by=x]
setkey(newDT, x, v)
setkey(DT, x, v)
DT[newDT]
      x v y       fvy
 [1,] a 1 1  4.333333
 [2,] a 2 3  5.333333
 [3,] a 3 6  6.333333
 [4,] b 4 1  7.333333
 [5,] b 5 3  8.333333
 [6,] b 6 6  9.333333
 [7,] c 7 1 10.333333
 [8,] c 8 3 11.333333
 [9,] c 9 6 12.333333
like image 38
Christoph_J Avatar answered Nov 09 '22 03:11

Christoph_J