I am looking for the best alternative to the not yet implemented (to my knowledge) assignment by reference in a data.table by groups. Using the data.table example, <pre class="prettyprint"><code>DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) x y v [1,] a 1 1 [2,] a 3 2 [3,] a 6 3 [4,] b 1 4 [5,] b 3 5 [6,] b 6 6 [7,] c 1 7 [8,] c 3 8 [9,] c 6 9 </code></pre> I want to add a new column z, containing f(y,v) grouped by values of x (lets take f(y,v)=mean(y)+v). Note that I do not want to print or store the result of this computation as in <pre class="prettyprint"><code>DT[,mean(y)+v,by=x] x V1 [1,] a 4.333333 [2,] a 5.333333 [3,] a 6.333333 [4,] b 7.333333 [5,] b 8.333333 [6,] b 9.333333 [7,] c 10.333333 [8,] c 11.333333 [9,] c 12.333333 </code></pre> but rather I want to add the result to DT: <pre class="prettyprint"><code> x y v V1 [1,] a 1 1 4.333333 [2,] a 3 2 5.333333 [3,] a 6 3 6.333333 [4,] b 1 4 7.333333 [5,] b 3 5 8.333333 [6,] b 6 6 9.333333 [7,] c 1 7 10.333333 [8,] c 3 8 11.333333 [9,] c 6 9 12.333333 </code></pre> my data.table has 262 MB, such that <pre class="prettyprint"><code>DT <- DT[,transform(.SD,mean(y)+v),by=x] </code></pre> is not an option, since I cannot fit DT twice in memory (which is implied by the copy operation, I think). Fact is I've never seen that operation finish. What alternatives do I have (until data.table comes with DT[,z:=mean(y)+v,by=x])? I just read about DT[newDT]. What's wrong here? <pre class="prettyprint"><code>newDT <- DT[,mean(y)+v,by=x] x V1 [1,] a 4.333333 [2,] a 5.333333 [3,] a 6.333333 [4,] b 7.333333 [5,] b 8.333333 [6,] b 9.333333 [7,] c 10.333333 [8,] c 11.333333 [9,] c 12.333333 </code></pre> (which is doable memory wise.) then: <pre class="prettyprint"><code>> DT[newDT] setkey(DT,x) setkey(newDT,x) x y v V1 a 1 1 4.333333 a 3 2 4.333333 a 6 3 4.333333 a 1 1 5.333333 a 3 2 5.333333 a 6 3 5.333333 a 1 1 6.333333 a 3 2 6.333333 a 6 3 6.333333 b 1 4 7.333333 b 3 5 7.333333 b 6 6 7.333333 b 1 4 8.333333 b 3 5 8.333333 b 6 6 8.333333 b 1 4 9.333333 b 3 5 9.333333 b 6 6 9.333333 c 1 7 10.333333 c 3 8 10.333333 c 6 9 10.333333 c 1 7 11.333333 c 3 8 11.333333 c 6 9 11.333333 c 1 7 12.333333 c 3 8 12.333333 c 6 9 12.333333 </code></pre> but that's not what I want. What's the mistake here?

I would do the following: <pre class="prettyprint"><code>DT[, list(fvy = mean(y)), by="x"][DT][, fvy := fvy + v] </code></pre> So basically, I split it up into two parts: First, I compute the mean of <code>y</code> and add that to DT, then I add <code>v</code> to the mean of <code>y</code>. Memory-wise I'm not sure if this really helps, but there is a good chance the author will have a look and let us know ;-) Regarding your question why it's not working: Basically, you end up with two data.tables that you want to merge: <code>DT</code> and <code>newDT</code>. Both data.tables have every key three times. So obviously, when you merge them, every combination is in the results and that's why you get a data.table with 9 a, b, and c's. So to do it your way which is quite similar to mine you need a second key: <pre class="prettyprint"><code>newDT <- DT[,list(fvy=mean(y)+v, v),by=x] setkey(newDT, x, v) setkey(DT, x, v) DT[newDT] x v y fvy [1,] a 1 1 4.333333 [2,] a 2 3 5.333333 [3,] a 3 6 6.333333 [4,] b 4 1 7.333333 [5,] b 5 3 8.333333 [6,] b 6 6 9.333333 [7,] c 7 1 10.333333 [8,] c 8 3 11.333333 [9,] c 9 6 12.333333 </code></pre>

data.table efficient alternative to grouped assignment as DT[ ,x:=f(y),by=z]?

Tags:

r

data.table

I am looking for the best alternative to the not yet implemented (to my knowledge) assignment by reference in a data.table by groups. Using the data.table example,

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
     x y v
[1,] a 1 1
[2,] a 3 2
[3,] a 6 3
[4,] b 1 4
[5,] b 3 5
[6,] b 6 6
[7,] c 1 7
[8,] c 3 8
[9,] c 6 9

I want to add a new column z, containing f(y,v) grouped by values of x (lets take f(y,v)=mean(y)+v). Note that I do not want to print or store the result of this computation as in

DT[,mean(y)+v,by=x]
      x        V1
 [1,] a  4.333333
 [2,] a  5.333333
 [3,] a  6.333333
 [4,] b  7.333333
 [5,] b  8.333333
 [6,] b  9.333333
 [7,] c 10.333333
 [8,] c 11.333333
 [9,] c 12.333333

but rather I want to add the result to DT:

     x y v        V1
[1,] a 1 1  4.333333
[2,] a 3 2  5.333333
[3,] a 6 3  6.333333
[4,] b 1 4  7.333333
[5,] b 3 5  8.333333
[6,] b 6 6  9.333333
[7,] c 1 7 10.333333
[8,] c 3 8 11.333333
[9,] c 6 9 12.333333

my data.table has 262 MB, such that

DT <- DT[,transform(.SD,mean(y)+v),by=x]

is not an option, since I cannot fit DT twice in memory (which is implied by the copy operation, I think). Fact is I've never seen that operation finish.

What alternatives do I have (until data.table comes with DT[,z:=mean(y)+v,by=x])?

I just read about DT[newDT]. What's wrong here?

newDT <- DT[,mean(y)+v,by=x]
      x        V1
 [1,] a  4.333333
 [2,] a  5.333333
 [3,] a  6.333333
 [4,] b  7.333333
 [5,] b  8.333333
 [6,] b  9.333333
 [7,] c 10.333333
 [8,] c 11.333333
 [9,] c 12.333333

(which is doable memory wise.) then:

> DT[newDT]
setkey(DT,x)
setkey(newDT,x)
x y v        V1
a 1 1  4.333333
a 3 2  4.333333
a 6 3  4.333333
a 1 1  5.333333
a 3 2  5.333333
a 6 3  5.333333
a 1 1  6.333333
a 3 2  6.333333
a 6 3  6.333333
b 1 4  7.333333
b 3 5  7.333333
b 6 6  7.333333
b 1 4  8.333333
b 3 5  8.333333
b 6 6  8.333333
b 1 4  9.333333
b 3 5  9.333333
b 6 6  9.333333
c 1 7 10.333333
c 3 8 10.333333
c 6 9 10.333333
c 1 7 11.333333
c 3 8 11.333333
c 6 9 11.333333
c 1 7 12.333333
c 3 8 12.333333
c 6 9 12.333333

but that's not what I want. What's the mistake here?

767

asked May 24 '12 00:05

Florian Oswald

2 Answers

DT[, xm := ave(y, x, FUN=mean) + v]

answered Nov 09 '22 03:11

IRTFM

I would do the following:

DT[, list(fvy = mean(y)), by="x"][DT][, fvy := fvy + v]

So basically, I split it up into two parts: First, I compute the mean of y and add that to DT, then I add v to the mean of y. Memory-wise I'm not sure if this really helps, but there is a good chance the author will have a look and let us know ;-)

Regarding your question why it's not working: Basically, you end up with two data.tables that you want to merge: DT and newDT. Both data.tables have every key three times. So obviously, when you merge them, every combination is in the results and that's why you get a data.table with 9 a, b, and c's.

So to do it your way which is quite similar to mine you need a second key:

newDT <- DT[,list(fvy=mean(y)+v, v),by=x]
setkey(newDT, x, v)
setkey(DT, x, v)
DT[newDT]
      x v y       fvy
 [1,] a 1 1  4.333333
 [2,] a 2 3  5.333333
 [3,] a 3 6  6.333333
 [4,] b 4 1  7.333333
 [5,] b 5 3  8.333333
 [6,] b 6 6  9.333333
 [7,] c 7 1 10.333333
 [8,] c 8 3 11.333333
 [9,] c 9 6 12.333333

answered Nov 09 '22 03:11

Christoph_J

Related questions
                            
                                R unable to process heavy tasks for many hours
                            
                                Inline LaTeX equations in shiny app with MathJax
                            
                                Including images on axis label in an animated ggplot2
                            
                                Create a col_types string specification for read_csv based on existing dataframe
                            
                                tidy eval vs base or get() vs sym() vs as.symbol()
                            
                                Reading in HTML/XML PDF file formats into R
                            
                                E-mail (or similar) notification when code execution is finished
                            
                                "failed to find package directory" error on Travis-CI while computing code coverage
                            
                                How to get line number of a function call in R?
                            
                                How to get session token when authenticating to JSON REST API (in R)
                            
                                calculate and plot vector field of an arbitrary rasterLayer
                            
                                How can I add lines to connect points on regression line to both x and y axis on ggplot?
                            
                                Force GAM model fit to be monotonic and go through a fixed point (x0, y0) with R mgcv
                            
                                How to control the dimension / size of a plot with ggplot2
                            
                                What are the ways to create an executable from R program
                            
                                Automating assignment in initialize() methods for Reference Classes in R
                            
                                na.locf but don't do trailing NAs
                            
                                How do you relate ggplot2 grobs back to the data?
                            
                                lm called from inside dlply throws "0 (non-NA) cases" error [r]
                            
                                Convert a character vector of mixed numbers, fractions, and integers to numeric

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With