Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filter rows by a function over values of each row, data.table

Tags:

r

data.table

Switch from data.frame syntax to data.table syntax is still not smooth for me. I thought the following thing should be trivial, but no. What I am doing wrong here:

> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9

I want something like this:

cols = c("y", "v") # a vector of column names or indexes
DT[rowSums(cols) > 5] # Take only rows where
# values at colums y and v satisfy a condition. 'rowSums' here is just an
# example it can be any function that return TRUE or FALSE when applied 
# to values of the row. 

This work, but what if I want to provide dynamic column names? and my tables have many columns?

>DT[eval( quote(y + v > 5))] #and the following command gives the same result
> DT[y + v > 5]
   x y v
1: a 6 3
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9
> DT[lapply(.SD, sum) > 5, .SDcols = 2:3] # Want the same result as above
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, ,.SDcols = 2:3]
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, , .SDcols = c("y", "v")]
Empty data.table (0 rows) of 3 cols: x,y,v

Update after the answers Since it turns out there are many ways of doing this thing, I want to see which want is the best performer. Following is the simulated timing code:

nr = 1e7
DT = data.table(x=sample(c("a","b","c"),nr, replace= T),
                y=sample(2:5, nr, replace = T), v=sample(1:9, nr, T))
threshold = 5
cols = c("y", "v")
col.ids = 2:3
filter.methods = 'DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]
DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]
DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]
DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]
'
fm <- strsplit(filter.methods, "\n")
fm <- unlist(fm)
timing = data.frame()
rn = NULL
for (e in sample(fm, length(fm))) { 
  # Seen some weird pattern with first item in 'fm', so scramble it
  rn <- c(rn, e)
  if (e == "DT[Reduce(`+`,eval(cols))>threshold]") {
    cols = quote(list(y, v))
  } else {
    cols = c("y", "v")
  }
  tm <- system.time(eval(parse(text = e)))
  timing <- rbind(timing, 
                  data.frame(
                    as.list(tm[c("user.self", "sys.self", "elapsed")])
                    )
                  )
}
rownames(timing) <- rn
timing[order(timing$elapsed),]

### OUTPUT ####
#                                                                     user.self sys.self elapsed
# DT[Reduce(`+`,eval(cols))>threshold]                                   0.416    0.168   0.581
# DT[Reduce(`+`, mget(cols)) > threshold]                                0.412    0.172   0.582
# DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]                     0.572    0.316   0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]                  0.568    0.320   0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]              0.576    0.316   0.890
# DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]               0.648    0.404   1.052
# DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]                   0.688    0.368   1.052
# DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]                0.612    0.440   1.053
# DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]                  0.692    0.368   1.058
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]     0.800    0.448   1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]         0.836    0.412   1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]            0.836    0.416   1.249

So the Champion in term of speed is:

DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]

I prefer the one my mget. And I think the reason that other are slower because they call rowSums, whereas Reduce only helps to form the expression. Sincere thanks to all that have given the answers. It's hard to decide for me to pick the 'accept' answer. Reduce-based is very specific to this sum operation, while rowSums-based is an example of using arbitrary function.

like image 243
biocyberman Avatar asked Aug 12 '14 11:08

biocyberman


1 Answers

cols = c("y", "v")

Try

DT[DT[, rowSums(.SD[, ..cols]) > 5]]

Or

DT[DT[, rowSums(.SD[, 2:3]) > 5]]

Or

DT[DT[, rowSums(.SD) > 5, .SDcols = cols]]

Or

DT[DT[, rowSums(.SD) > 5, .SDcols = c("y", "v")]]

Or

DT[DT[, rowSums(.SD) > 5, .SDcols = 2:3]]

Or

DT[ ,.SD[rowSums(.SD[, 2:3]) > 5]]

Or

DT[ ,.SD[rowSums(.SD[, ..cols]) > 5]]

Or

DT[, .SD[rowSums(.SD) > 5], .SDcols = cols, by = x]

Or

DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3, by = x]

Or

DT[, .SD[rowSums(.SD) > 5], .SDcols = c("y", "v"), by = x]

Each will result in

#    x y v
# 1: a 6 3
# 2: b 3 5
# 3: b 6 6
# 4: c 1 7
# 5: c 3 8
# 6: c 6 9

Some explanations:

  1. .SD is also a data.table object which can operate within DT scope; ..cols looks in the calling frames above the [] where it's used for the cols object to select columns. Thus, this line DT[ ,rowSums(.SD[, ..cols]) > 5] will return a logical vector indicating in which cases DT has y + v > 5. So we will add another DT in order to select this indices within DT

  2. When you use .SDcols, it will narrow .SD only to these columns. Thus if you only do something like DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3], you will lose the x column, thus the by = x was added.

  3. Another option when using .SDcols is to return a logical vector and then imbed it in another DT

like image 57
David Arenburg Avatar answered Sep 22 '22 11:09

David Arenburg