What is the right way to add marginal sums to a data table?
What I do right now:
> (a <- data.table(x=c(1,2,1,2,2,3,3),y=c(10,10,20,20,30,30,40),z=1:7,key=c("x")))
x y z
1: 1 10 1
2: 1 20 3
3: 2 10 2
4: 2 20 4
5: 2 30 5
6: 3 30 6
7: 3 40 7
> (a <- a[a[,sum(z),by=x]])
x y z V1
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
> setnames(a,"V1","x.z")
> setkeyv(a,"y")
> (a <- a[a[,sum(z),by=y]])
y x z x.z V1
1: 10 1 1 4 3
2: 10 2 2 11 3
3: 20 1 3 4 7
4: 20 2 4 11 7
5: 30 2 5 11 11
6: 30 3 6 13 11
7: 40 3 7 13 7
> setnames(a,"V1","y.z")
I am pretty sure this is not The Right Way.
What is?
One alternative is this one:
> a[,Sum:=sum(z), by="x"]
> a
x y z Sum
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
:=
usage:The :=
operator enables add/update by reference. With this, you can:
add new columns or update existing columns by reference
DT[, x2 := x+1] # add one new column
DT[, `:=`(x2 = x+1, y2 = y+1)] # adding more than 1 col
DT[, x := x+1] # modify existing column
add or update certain rows of new or existing columns by reference
DT[x == 1L, y := NA] # modify 'y' just where expression in 'i' matches
DT[x == 1L, `:=`(y = NA, z=NA)] # same but for multiple columns
DT[x == 1L, newcol := 5L] # matched rows for 'newcol' will be 5, all other 'NA'
add or update cols while grouping, by reference - by default, the computed result is recycled within each group.
DT[, zsum := sum(z), by=x]
Here,
sum(z)
returns 1 value for each group inx
. The result is then recycled for length of that group and is added/updated by reference tozsum
.
add or update during a by-without-by operation. That is, when you perform a data.table
join and you want to add/update column while joining:
X <- data.table(x=rep(1:3, each=2), y=1:6, key="x")
Y <- data.table(x=1:3, y=c(3L, 1L, 2L), key="x")
X[Y, y.gt := y > i.y]
Finally, you can also remove columns by reference (i.e. instantly even it's a 20GB table) :
DT[, x := NULL] # just 1 column
DT[, c("x","y") := NULL] # 1 or more columns
toRemove = c("x","y")
DT[, (toRemove) := NULL] # wrap with brackets to lookup variable
Hope this helps clarify the usage on :=
. Also check out ?set
. It is similar to :=
, but with the limitation that it can not be combined with joins. This allows for it to be faster inside a for
loop (due to reduced overhead from not calling [.data.table
) for all operations it is capable of than :=
.
It can be quite handy, especially, in some scenarios. See this post for a nice usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With