I am looking for the best alternative to the not yet implemented (to my knowledge) assignment by reference in a data.table by groups. Using the data.table example,
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
x y v
[1,] a 1 1
[2,] a 3 2
[3,] a 6 3
[4,] b 1 4
[5,] b 3 5
[6,] b 6 6
[7,] c 1 7
[8,] c 3 8
[9,] c 6 9
I want to add a new column z, containing f(y,v) grouped by values of x (lets take f(y,v)=mean(y)+v). Note that I do not want to print or store the result of this computation as in
DT[,mean(y)+v,by=x]
x V1
[1,] a 4.333333
[2,] a 5.333333
[3,] a 6.333333
[4,] b 7.333333
[5,] b 8.333333
[6,] b 9.333333
[7,] c 10.333333
[8,] c 11.333333
[9,] c 12.333333
but rather I want to add the result to DT:
x y v V1
[1,] a 1 1 4.333333
[2,] a 3 2 5.333333
[3,] a 6 3 6.333333
[4,] b 1 4 7.333333
[5,] b 3 5 8.333333
[6,] b 6 6 9.333333
[7,] c 1 7 10.333333
[8,] c 3 8 11.333333
[9,] c 6 9 12.333333
my data.table has 262 MB, such that
DT <- DT[,transform(.SD,mean(y)+v),by=x]
is not an option, since I cannot fit DT twice in memory (which is implied by the copy operation, I think). Fact is I've never seen that operation finish.
What alternatives do I have (until data.table comes with DT[,z:=mean(y)+v,by=x])?
I just read about DT[newDT]. What's wrong here?
newDT <- DT[,mean(y)+v,by=x]
x V1
[1,] a 4.333333
[2,] a 5.333333
[3,] a 6.333333
[4,] b 7.333333
[5,] b 8.333333
[6,] b 9.333333
[7,] c 10.333333
[8,] c 11.333333
[9,] c 12.333333
(which is doable memory wise.) then:
> DT[newDT]
setkey(DT,x)
setkey(newDT,x)
x y v V1
a 1 1 4.333333
a 3 2 4.333333
a 6 3 4.333333
a 1 1 5.333333
a 3 2 5.333333
a 6 3 5.333333
a 1 1 6.333333
a 3 2 6.333333
a 6 3 6.333333
b 1 4 7.333333
b 3 5 7.333333
b 6 6 7.333333
b 1 4 8.333333
b 3 5 8.333333
b 6 6 8.333333
b 1 4 9.333333
b 3 5 9.333333
b 6 6 9.333333
c 1 7 10.333333
c 3 8 10.333333
c 6 9 10.333333
c 1 7 11.333333
c 3 8 11.333333
c 6 9 11.333333
c 1 7 12.333333
c 3 8 12.333333
c 6 9 12.333333
but that's not what I want. What's the mistake here?
In CRAN, there are more than 200 packages that are dependent on data.table which makes it listed in the top 5 R's package. The first parameter of data.table i refers to rows. It implies subsetting rows.
Now, let’s move on to the second major and awesome feature of R data.table: grouping using by . In base R, grouping is accomplished using the aggregate() function.
Explanation: groupby (‘DEPT’)groups records by department, and count () calculates the number of employees in each group. You group records by multiple fields and then perform aggregate over each group. We handle it in a similar way.
Conclusion data.table is a package is used for working with tabular data in R. It provides the efficient data.table object which is a much improved version of the default data.frame. It is super fast and has intuitive and terse syntax.
DT[, xm := ave(y, x, FUN=mean) + v]
I would do the following:
DT[, list(fvy = mean(y)), by="x"][DT][, fvy := fvy + v]
So basically, I split it up into two parts: First, I compute the mean of y
and add that to DT, then I add v
to the mean of y
. Memory-wise I'm not sure if this really helps, but there is a good chance the author will have a look and let us know ;-)
Regarding your question why it's not working: Basically, you end up with two data.tables that you want to merge: DT
and newDT
. Both data.tables have every key three times. So obviously, when you merge them, every combination is in the results and that's why you get a data.table with 9 a, b, and c's.
So to do it your way which is quite similar to mine you need a second key:
newDT <- DT[,list(fvy=mean(y)+v, v),by=x]
setkey(newDT, x, v)
setkey(DT, x, v)
DT[newDT]
x v y fvy
[1,] a 1 1 4.333333
[2,] a 2 3 5.333333
[3,] a 3 6 6.333333
[4,] b 4 1 7.333333
[5,] b 5 3 8.333333
[6,] b 6 6 9.333333
[7,] c 7 1 10.333333
[8,] c 8 3 11.333333
[9,] c 9 6 12.333333
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With