Given a sample data frame:
dt <- data.table(value=1:10,start=c(1,4,5,8,6,3,2,1,9,4),finish=c(3,7,8,9,10,10,4,10,10,8))
I want to add a new column which may be named as mean_column. i'th row of this colum should have the value
mean( value[ seq( from = start[i], to=finish[i] ) ] )
The real data I'm working on has 20 million row, so I need to find a fast way to do this calculation.
Edit: the value column in the data.table doesn't need to be an ordered sequence as in the example. Each value in this column may take any positive number.
Here's an approach that uses non-equi joins from data.table.
dt <- data.table(value=c(10,1:9),start=c(1,4,5,8,6,3,2,1,9,4),finish=c(3,7,8,9,10,10,4,10,10,8))
dt[, id := .I]
dt[dt,
on = .(id >= start,
id <= finish),
.(i.id, i.value, mean_col = mean(x.value)),
by = .EACHI,
allow.cartesian = T]
id id i.id i.value mean_col
<int> <int> <int> <num> <num>
1: 1 3 1 10 4.333333
2: 4 7 2 1 4.500000
3: 5 8 3 2 5.500000
4: 8 9 4 3 7.500000
5: 6 10 5 4 7.000000
6: 3 10 6 5 5.500000
7: 2 4 7 6 2.000000
8: 1 10 8 7 5.500000
9: 9 10 9 8 8.500000
10: 4 8 10 9 5.000000
Trying on 2,000,000 rows this takes 4 seconds on my computer and provides same answer as @jay.sf.
n = 2e6
dt <- data.table(value = sample(1000L, n, TRUE), start = sample(n, n, TRUE))
dt[, finish := start + sample(30, n, TRUE)]
dt[finish > n, finish := n]
system.time({
dt[, id := .I]
dt[dt,
on = .(id >= start,
id <= finish),
.(i.id, i.value, mean_col = mean(x.value)),
by = .EACHI,
allow.cartesian = T]
})
## user system elapsed
## 3.78 0.01 3.69
#jay.sf base approach
system.time({
FUNV3 <- Vectorize(function(x, y) x:y)
dt$mean.column2 <- with(dt, sapply(FUNV3(start, finish), function(x) mean(value[x])))
})
## user system elapsed
## 24.45 0.04 24.72
all.equal(dt$mean.column2, dt[dt,
on = .(id >= start,
id <= finish),
.(i.id, i.value, mean_col = mean(x.value)),
by = .EACHI,
allow.cartesian = T]$mean_col)
##[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With