Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How I can take mean for different subsets of a specific column in a data.table?

Given a sample data frame:

dt <- data.table(value=1:10,start=c(1,4,5,8,6,3,2,1,9,4),finish=c(3,7,8,9,10,10,4,10,10,8))

I want to add a new column which may be named as mean_column. i'th row of this colum should have the value

mean( value[ seq( from = start[i], to=finish[i] ) ] )

The real data I'm working on has 20 million row, so I need to find a fast way to do this calculation.

Edit: the value column in the data.table doesn't need to be an ordered sequence as in the example. Each value in this column may take any positive number.

like image 587
serdaryurek Avatar asked Dec 03 '22 17:12

serdaryurek


1 Answers

Here's an approach that uses non-equi joins from data.table.

dt <- data.table(value=c(10,1:9),start=c(1,4,5,8,6,3,2,1,9,4),finish=c(3,7,8,9,10,10,4,10,10,8))
dt[, id := .I]

dt[dt,
   on = .(id >= start,
          id <= finish),
   .(i.id, i.value, mean_col = mean(x.value)),
   by = .EACHI,
   allow.cartesian = T]

       id    id  i.id i.value mean_col
    <int> <int> <int>   <num>    <num>
 1:     1     3     1      10 4.333333
 2:     4     7     2       1 4.500000
 3:     5     8     3       2 5.500000
 4:     8     9     4       3 7.500000
 5:     6    10     5       4 7.000000
 6:     3    10     6       5 5.500000
 7:     2     4     7       6 2.000000
 8:     1    10     8       7 5.500000
 9:     9    10     9       8 8.500000
10:     4     8    10       9 5.000000

Trying on 2,000,000 rows this takes 4 seconds on my computer and provides same answer as @jay.sf.

n = 2e6
dt <- data.table(value = sample(1000L, n, TRUE), start = sample(n, n, TRUE))
dt[, finish := start + sample(30, n, TRUE)]
dt[finish > n, finish := n]

system.time({
dt[, id := .I]
  dt[dt,
     on = .(id >= start,
            id <= finish),
     .(i.id, i.value, mean_col = mean(x.value)),
     by = .EACHI,
     allow.cartesian = T]
})

##    user  system elapsed 
##   3.78    0.01    3.69 

#jay.sf base approach
system.time({
  FUNV3 <- Vectorize(function(x, y) x:y)
dt$mean.column2 <- with(dt, sapply(FUNV3(start, finish), function(x) mean(value[x])))
})

##   user  system elapsed 
##  24.45    0.04   24.72 

all.equal(dt$mean.column2,   dt[dt,
                                on = .(id >= start,
                                       id <= finish),
                                .(i.id, i.value, mean_col = mean(x.value)),
                                by = .EACHI,
                                allow.cartesian = T]$mean_col)

##[1] TRUE
like image 185
Cole Avatar answered Dec 11 '22 17:12

Cole