Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add marginals to data table?

Tags:

r

data.table

What is the right way to add marginal sums to a data table?

What I do right now:

> (a <- data.table(x=c(1,2,1,2,2,3,3),y=c(10,10,20,20,30,30,40),z=1:7,key=c("x")))
   x  y z
1: 1 10 1
2: 1 20 3
3: 2 10 2
4: 2 20 4
5: 2 30 5
6: 3 30 6
7: 3 40 7
> (a <- a[a[,sum(z),by=x]])
   x  y z V1
1: 1 10 1  4
2: 1 20 3  4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
> setnames(a,"V1","x.z")
> setkeyv(a,"y")
> (a <- a[a[,sum(z),by=y]])
    y x z x.z V1
1: 10 1 1   4  3
2: 10 2 2  11  3
3: 20 1 3   4  7
4: 20 2 4  11  7
5: 30 2 5  11 11
6: 30 3 6  13 11
7: 40 3 7  13  7
> setnames(a,"V1","y.z")

I am pretty sure this is not The Right Way.

What is?

like image 473
sds Avatar asked Jan 06 '14 21:01

sds


1 Answers

One alternative is this one:

> a[,Sum:=sum(z), by="x"]
> a
   x  y z Sum
1: 1 10 1   4
2: 1 20 3   4
3: 2 10 2  11
4: 2 20 4  11
5: 2 30 5  11
6: 3 30 6  13
7: 3 40 7  13

Edit: Some more explanation on := usage:

The := operator enables add/update by reference. With this, you can:

  • add new columns or update existing columns by reference

    DT[, x2 := x+1] # add one new column
    DT[, `:=`(x2 = x+1, y2 = y+1)] # adding more than 1 col
    DT[, x := x+1] # modify existing column
    
  • add or update certain rows of new or existing columns by reference

    DT[x == 1L, y := NA] # modify 'y' just where expression in 'i' matches
    DT[x == 1L, `:=`(y = NA, z=NA)] # same but for multiple columns
    DT[x == 1L, newcol := 5L] # matched rows for 'newcol' will be 5, all other 'NA'
    
  • add or update cols while grouping, by reference - by default, the computed result is recycled within each group.

    DT[, zsum := sum(z), by=x]
    

Here, sum(z) returns 1 value for each group in x. The result is then recycled for length of that group and is added/updated by reference to zsum.

  • add or update during a by-without-by operation. That is, when you perform a data.table join and you want to add/update column while joining:

    X <- data.table(x=rep(1:3, each=2), y=1:6, key="x")
    Y <- data.table(x=1:3, y=c(3L, 1L, 2L), key="x")
    X[Y, y.gt := y > i.y] 
    
  • Finally, you can also remove columns by reference (i.e. instantly even it's a 20GB table) :

    DT[, x := NULL] # just 1 column
    DT[, c("x","y") := NULL]   # 1 or more columns
    
    toRemove = c("x","y")
    DT[, (toRemove) := NULL]   # wrap with brackets to lookup variable
    

Hope this helps clarify the usage on :=. Also check out ?set. It is similar to :=, but with the limitation that it can not be combined with joins. This allows for it to be faster inside a for loop (due to reduced overhead from not calling [.data.table) for all operations it is capable of than :=.

It can be quite handy, especially, in some scenarios. See this post for a nice usage.

like image 169
Jilber Urbina Avatar answered Sep 18 '22 12:09

Jilber Urbina