How I can take mean for different subsets of a specific column in a data.table?

Question

Given a sample data frame:

dt <- data.table(value=1:10,start=c(1,4,5,8,6,3,2,1,9,4),finish=c(3,7,8,9,10,10,4,10,10,8))

I want to add a new column which may be named as mean_column. i'th row of this colum should have the value

mean( value[ seq( from = start[i], to=finish[i] ) ] )

The real data I'm working on has 20 million row, so I need to find a fast way to do this calculation.

Edit: the value column in the data.table doesn't need to be an ordered sequence as in the example. Each value in this column may take any positive number.

Cole · Accepted Answer

Here's an approach that uses non-equi joins from data.table.

dt <- data.table(value=c(10,1:9),start=c(1,4,5,8,6,3,2,1,9,4),finish=c(3,7,8,9,10,10,4,10,10,8))
dt[, id := .I]

dt[dt,
   on = .(id >= start,
          id <= finish),
   .(i.id, i.value, mean_col = mean(x.value)),
   by = .EACHI,
   allow.cartesian = T]

       id    id  i.id i.value mean_col
    <int> <int> <int>   <num>    <num>
 1:     1     3     1      10 4.333333
 2:     4     7     2       1 4.500000
 3:     5     8     3       2 5.500000
 4:     8     9     4       3 7.500000
 5:     6    10     5       4 7.000000
 6:     3    10     6       5 5.500000
 7:     2     4     7       6 2.000000
 8:     1    10     8       7 5.500000
 9:     9    10     9       8 8.500000
10:     4     8    10       9 5.000000

Trying on 2,000,000 rows this takes 4 seconds on my computer and provides same answer as @jay.sf.

n = 2e6
dt <- data.table(value = sample(1000L, n, TRUE), start = sample(n, n, TRUE))
dt[, finish := start + sample(30, n, TRUE)]
dt[finish > n, finish := n]

system.time({
dt[, id := .I]
  dt[dt,
     on = .(id >= start,
            id <= finish),
     .(i.id, i.value, mean_col = mean(x.value)),
     by = .EACHI,
     allow.cartesian = T]
})

##    user  system elapsed 
##   3.78    0.01    3.69 

#jay.sf base approach
system.time({
  FUNV3 <- Vectorize(function(x, y) x:y)
dt$mean.column2 <- with(dt, sapply(FUNV3(start, finish), function(x) mean(value[x])))
})

##   user  system elapsed 
##  24.45    0.04   24.72 

all.equal(dt$mean.column2,   dt[dt,
                                on = .(id >= start,
                                       id <= finish),
                                .(i.id, i.value, mean_col = mean(x.value)),
                                by = .EACHI,
                                allow.cartesian = T]$mean_col)

##[1] TRUE

How I can take mean for different subsets of a specific column in a data.table?

Tags:

dataframe

r

data.table

dplyr

serdaryurek

1 Answers

Cole

Recent Activity

Donate For Us

How I can take mean for different subsets of a specific column in a data.table?

Tags:

dataframe

r

data.table

dplyr

serdaryurek

1 Answers

Cole

Related questions

Recent Activity

Donate For Us