Is there a simple way to achieve something similar to `x[,c:=mean(a), by=b]$c`?

Question

> x <- data.table(a=1:10, b=rep(1:2, 5))
> x
     a b
 1:  1 1
 2:  2 2
 3:  3 1
 4:  4 2
 5:  5 1
 6:  6 2
 7:  7 1
 8:  8 2
 9:  9 1
10: 10 2
> x[,c:=mean(a), by=b]
> y <- x$c
> y
 [1] 5 6 5 6 5 6 5 6 5 6

Ultimately, I am interested in y as a vector, and I don't want to add c to the data.table. Is there an easier way to get y from the original x?

The problem arises when I tried to apply different weights to different group in a histogram.

# here weight would be the same for all colour, but I wish they differ.
geom_freqpoly(aes(colour=group, weight=mean(y)), binwidth=1)

IRTFM · Accepted Answer

> with(x, ave(a, b, FUN=mean) )
 [1] 5 6 5 6 5 6 5 6 5 6

Just to let the data.table experts know, I am aware that this may not scale well to multi-million record datasets and I am appreciative of the other posts on this topic. I've been using data.table to good effect on my larger analyses. It was only because of an expressed desire for simplicity and non-modification of the data argument that I posted.

Ricardo Saporta · Answer

you can daisy-chain the "[" operator:

x[, c := mean(a), by=b][, c]
# [1] 5 6 5 6 5 6 5 6 5 6

The result from "[.data.table" is itself a data.table, so you can just add another one right after it.

I just noticed the comments about not wanting to modify x. Notice that somehow you need to recycle the vector c. R normally handles this for you. If you want to do it manually, use:

 x[, list(c=mean(a)), by=b][, rep(c, length(x$a)/length(c))]
 # [1] 5 6 5 6 5 6 5 6 5 6

As for the motivation in not modifying x, notice that there is almost negligible overhead in assigning a column and then dropping it later with x[, c := NULL] so perhaps temporarily modifying the DT is the way to go.

As per @Frank's requests, here is a simple benchmark: With 100 elements, by is faster. But the speed diminishes quickly

# The call used for benchmarking is as follows: 
library(microbenchmark)
microbenchmark(B = as.vector(by(x$a,x$b,mean)[as.character(x$b)]), 
               D = x[, list(c=mean(a)), by=b][, rep(c, length(x$a)/length(c))]
               )



# medium sized x
N <- 1e4
x <- {set.seed(1); data.table(a=1:(N), b=sample(5, N, TRUE), key="b")}

Unit: milliseconds
 expr      min       lq   median       uq       max neval
    B 6.150740 6.284466 6.403332 7.790877 10.339314   100
    D 1.268631 1.337959 1.441184 1.525279  2.963625   100

eddi · Answer

Here's another way of doing it without modifying the original data.table, but imo that's an entirely artificial and unnecessary constraint, i.e. you have the best solution already.

x[, list(.I, mean(a)), by = b][order(.I), V2]
#[1] 5 6 5 6 5 6 5 6 5 6

# or for faster ordering
setkey(x[, list(.I, mean(a)), by = b], .I)$V2

Is there a simple way to achieve something similar to `x[,c:=mean(a), by=b]$c`?

Tags:

r

data.table

ggplot2

colinfang

3 Answers

IRTFM

Ricardo Saporta

eddi

Recent Activity

Donate For Us

Is there a simple way to achieve something similar to `x[,c:=mean(a), by=b]$c`?

Tags:

r

data.table

ggplot2

colinfang

3 Answers

IRTFM

Ricardo Saporta

eddi

Related questions

Recent Activity

Donate For Us