Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a simple way to achieve something similar to `x[,c:=mean(a), by=b]$c`?

> x <- data.table(a=1:10, b=rep(1:2, 5))
> x
     a b
 1:  1 1
 2:  2 2
 3:  3 1
 4:  4 2
 5:  5 1
 6:  6 2
 7:  7 1
 8:  8 2
 9:  9 1
10: 10 2
> x[,c:=mean(a), by=b]
> y <- x$c
> y
 [1] 5 6 5 6 5 6 5 6 5 6

Ultimately, I am interested in y as a vector, and I don't want to add c to the data.table. Is there an easier way to get y from the original x?

The problem arises when I tried to apply different weights to different group in a histogram.

# here weight would be the same for all colour, but I wish they differ.
geom_freqpoly(aes(colour=group, weight=mean(y)), binwidth=1)
like image 621
colinfang Avatar asked Sep 30 '13 14:09

colinfang


3 Answers

> with(x, ave(a, b, FUN=mean) )
 [1] 5 6 5 6 5 6 5 6 5 6

Just to let the data.table experts know, I am aware that this may not scale well to multi-million record datasets and I am appreciative of the other posts on this topic. I've been using data.table to good effect on my larger analyses. It was only because of an expressed desire for simplicity and non-modification of the data argument that I posted.

like image 172
IRTFM Avatar answered Nov 01 '22 16:11

IRTFM


you can daisy-chain the "[" operator:

x[, c := mean(a), by=b][, c]
# [1] 5 6 5 6 5 6 5 6 5 6

The result from "[.data.table" is itself a data.table, so you can just add another one right after it.


I just noticed the comments about not wanting to modify x. Notice that somehow you need to recycle the vector c. R normally handles this for you. If you want to do it manually, use:

 x[, list(c=mean(a)), by=b][, rep(c, length(x$a)/length(c))]
 # [1] 5 6 5 6 5 6 5 6 5 6

As for the motivation in not modifying x, notice that there is almost negligible overhead in assigning a column and then dropping it later with x[, c := NULL] so perhaps temporarily modifying the DT is the way to go.


As per @Frank's requests, here is a simple benchmark: With 100 elements, by is faster. But the speed diminishes quickly

# The call used for benchmarking is as follows: 
library(microbenchmark)
microbenchmark(B = as.vector(by(x$a,x$b,mean)[as.character(x$b)]), 
               D = x[, list(c=mean(a)), by=b][, rep(c, length(x$a)/length(c))]
               )



# medium sized x
N <- 1e4
x <- {set.seed(1); data.table(a=1:(N), b=sample(5, N, TRUE), key="b")}

Unit: milliseconds
 expr      min       lq   median       uq       max neval
    B 6.150740 6.284466 6.403332 7.790877 10.339314   100
    D 1.268631 1.337959 1.441184 1.525279  2.963625   100
like image 30
Ricardo Saporta Avatar answered Nov 01 '22 15:11

Ricardo Saporta


Here's another way of doing it without modifying the original data.table, but imo that's an entirely artificial and unnecessary constraint, i.e. you have the best solution already.

x[, list(.I, mean(a)), by = b][order(.I), V2]
#[1] 5 6 5 6 5 6 5 6 5 6

# or for faster ordering
setkey(x[, list(.I, mean(a)), by = b], .I)$V2
like image 41
eddi Avatar answered Nov 01 '22 17:11

eddi