> x <- data.table(a=1:10, b=rep(1:2, 5))
> x
a b
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
> x[,c:=mean(a), by=b]
> y <- x$c
> y
[1] 5 6 5 6 5 6 5 6 5 6
Ultimately, I am interested in y
as a vector, and I don't want to add c
to the data.table
. Is there an easier way to get y
from the original x
?
The problem arises when I tried to apply different weights to different group in a histogram.
# here weight would be the same for all colour, but I wish they differ.
geom_freqpoly(aes(colour=group, weight=mean(y)), binwidth=1)
> with(x, ave(a, b, FUN=mean) )
[1] 5 6 5 6 5 6 5 6 5 6
Just to let the data.table experts know, I am aware that this may not scale well to multi-million record datasets and I am appreciative of the other posts on this topic. I've been using data.table to good effect on my larger analyses. It was only because of an expressed desire for simplicity and non-modification of the data argument that I posted.
you can daisy-chain the "["
operator:
x[, c := mean(a), by=b][, c]
# [1] 5 6 5 6 5 6 5 6 5 6
The result from "[.data.table"
is itself a data.table, so you can just add another one right after it.
I just noticed the comments about not wanting to modify x. Notice that somehow you need to recycle the vector c
. R normally handles this for you. If you want to do it manually, use:
x[, list(c=mean(a)), by=b][, rep(c, length(x$a)/length(c))]
# [1] 5 6 5 6 5 6 5 6 5 6
As for the motivation in not modifying x
, notice that there is almost negligible overhead in assigning a column and then dropping it later with x[, c := NULL]
so perhaps temporarily modifying the DT is the way to go.
As per @Frank's requests, here is a simple benchmark:
With 100 elements, by
is faster. But the speed diminishes quickly
# The call used for benchmarking is as follows:
library(microbenchmark)
microbenchmark(B = as.vector(by(x$a,x$b,mean)[as.character(x$b)]),
D = x[, list(c=mean(a)), by=b][, rep(c, length(x$a)/length(c))]
)
# medium sized x
N <- 1e4
x <- {set.seed(1); data.table(a=1:(N), b=sample(5, N, TRUE), key="b")}
Unit: milliseconds
expr min lq median uq max neval
B 6.150740 6.284466 6.403332 7.790877 10.339314 100
D 1.268631 1.337959 1.441184 1.525279 2.963625 100
Here's another way of doing it without modifying the original data.table
, but imo that's an entirely artificial and unnecessary constraint, i.e. you have the best solution already.
x[, list(.I, mean(a)), by = b][order(.I), V2]
#[1] 5 6 5 6 5 6 5 6 5 6
# or for faster ordering
setkey(x[, list(.I, mean(a)), by = b], .I)$V2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With