Add marginals to data table?

Question

What is the right way to add marginal sums to a data table?

What I do right now:

> (a <- data.table(x=c(1,2,1,2,2,3,3),y=c(10,10,20,20,30,30,40),z=1:7,key=c("x")))
   x  y z
1: 1 10 1
2: 1 20 3
3: 2 10 2
4: 2 20 4
5: 2 30 5
6: 3 30 6
7: 3 40 7
> (a <- a[a[,sum(z),by=x]])
   x  y z V1
1: 1 10 1  4
2: 1 20 3  4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
> setnames(a,"V1","x.z")
> setkeyv(a,"y")
> (a <- a[a[,sum(z),by=y]])
    y x z x.z V1
1: 10 1 1   4  3
2: 10 2 2  11  3
3: 20 1 3   4  7
4: 20 2 4  11  7
5: 30 2 5  11 11
6: 30 3 6  13 11
7: 40 3 7  13  7
> setnames(a,"V1","y.z")

I am pretty sure this is not The Right Way.

What is?

Jilber Urbina · Accepted Answer

One alternative is this one:

> a[,Sum:=sum(z), by="x"]
> a
   x  y z Sum
1: 1 10 1   4
2: 1 20 3   4
3: 2 10 2  11
4: 2 20 4  11
5: 2 30 5  11
6: 3 30 6  13
7: 3 40 7  13

Edit: Some more explanation on `:=` usage:

The := operator enables add/update by reference. With this, you can:

add new columns or update existing columns by reference

DT[, x2 := x+1] # add one new column
DT[, `:=`(x2 = x+1, y2 = y+1)] # adding more than 1 col
DT[, x := x+1] # modify existing column

add or update certain rows of new or existing columns by reference

DT[x == 1L, y := NA] # modify 'y' just where expression in 'i' matches
DT[x == 1L, `:=`(y = NA, z=NA)] # same but for multiple columns
DT[x == 1L, newcol := 5L] # matched rows for 'newcol' will be 5, all other 'NA'

add or update cols while grouping, by reference - by default, the computed result is recycled within each group.
```
DT[, zsum := sum(z), by=x]
```

Here, sum(z) returns 1 value for each group in x. The result is then recycled for length of that group and is added/updated by reference to zsum.

add or update during a by-without-by operation. That is, when you perform a data.table join and you want to add/update column while joining:
```
X <- data.table(x=rep(1:3, each=2), y=1:6, key="x")
Y <- data.table(x=1:3, y=c(3L, 1L, 2L), key="x")
X[Y, y.gt := y > i.y] 
```

Finally, you can also remove columns by reference (i.e. instantly even it's a 20GB table) :

DT[, x := NULL] # just 1 column
DT[, c("x","y") := NULL]   # 1 or more columns

toRemove = c("x","y")
DT[, (toRemove) := NULL]   # wrap with brackets to lookup variable

Hope this helps clarify the usage on :=. Also check out ?set. It is similar to :=, but with the limitation that it can not be combined with joins. This allows for it to be faster inside a for loop (due to reduced overhead from not calling [.data.table) for all operations it is capable of than :=.

It can be quite handy, especially, in some scenarios. See this post for a nice usage.

Add marginals to data table?

Tags:

r

data.table

sds

1 Answers

Edit: Some more explanation on `:=` usage:

Jilber Urbina

Recent Activity

Donate For Us

Add marginals to data table?

Tags:

r

data.table

sds

1 Answers

Edit: Some more explanation on := usage:

Jilber Urbina

Related questions

Recent Activity

Donate For Us

Edit: Some more explanation on `:=` usage: