Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset rows in data.table if all specified columns match a criterion

Tags:

r

data.table

I have a data.table 'a' and a vector of column names 'cols':

a <- data.table(n = c("case1", "case2", "case3"), x = c(0,2,5), y = c(1,1,4), z = c(1,1,0))
cols <- c("x", "y", "z")
a
#        n x y z
# 1: case1 0 1 1
# 2: case2 2 1 1
# 3: case3 5 4 0

I want to select rows from a were all values in the columns whose names are saved in cols are above 0.

Desired result:

#        n x y z
# 2: case2 2 1 1

I used apply in combination with all(), but I think there is a much faster way with data.table to do this. My original data is of course much much larger and cols contains up 80 column names. Thanks for your help!


Benchmarks

Thank you for your answers! All of them work but obviously with different performance. Please check the comments of the accepted answer for a benchmark. The fastest way to do this is, indeed:

a[ a[, do.call(pmin, .SD) > 0, .SDcols = cols] ]

I also replicated the benchmarks for the different solutions using the rbenchmark package and my original dataset with slightly different parameters (880,000 rows, 64 columns from which 62 are selected) and can confirm the speed ranking of the different solutions (10 replications have been made):

z[z[, !Reduce(`+`, lapply(.SD, `<`, 11)),.SDcols = col.names]]: 3.32 sec

z[apply(z[, col.names, with = FALSE], 1, function(x) all(x > 10))]: 37.41 sec

z[ z[, do.call(pmin, .SD) > 10, .SDcols = col.names] ]: 2.03 sec

z[rowSums(z[, lapply(.SD, `<`, 11), .SDcols = col.names]) == 0]: 4.84 sec

like image 290
swolf Avatar asked Sep 30 '16 10:09

swolf


1 Answers

We can use Reduce with .SDcols. Specify the columns of interest in .SDcols, loop through the Subset of Data.table (.SD) check whether it is equal 0, get the sum of each row with Reduce, negate (!) to get a logical vector which returns TRUE when there are no 0 elements and use that to subset the rows of 'a'

a[a[, !Reduce(`+`, lapply(.SD, `<=`, 0)),.SDcols = cols]]
#       n x y z
#1: case2 2 1 1

Or as @Frank mentioned in the comments, pmin can be used as well

a[a[, do.call(pmin, .SD), .SDcols = cols]>0]
like image 173
akrun Avatar answered Oct 24 '22 08:10

akrun