Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to subset data.table based on value in any of selected columns [duplicate]

Tags:

r

data.table

Let's say I've got a data.table with 6 columns

library(data.table)
set.seed(123)
dt <- data.table( id = 1:100,
                  p1 = sample(1:10, 100, replace = TRUE ),
                  p2 = sample(1:10, 100, replace = TRUE ),
                  p3 = sample(1:10, 100, replace = TRUE ),
                  p4 = sample(1:10, 100, replace = TRUE ),
                  p5 = sample(1:10, 100, replace = TRUE ) )

Now, I want to subset this data.table, on the p1 - pn columns (here: p1-p5). I want to keep all rows where any of the p-columns contains the value of 10.

For this small sample data.table, this can be done manually with

test1 <- dt[ p1 == 10 | p2 == 10 | p3 == 10 | p4 == 10 | p5 == 10, ]

But my production data contains dozens of p-columns, so typing them all out manually would be a pain...

My current solution is to first create a vector with the column-names i need:

cols <- grep( "^p", names( dt ), value = TRUE )

...and then do the subsetting using apply:

test2 <- dt[ apply( dt[, ..cols ], 1, function(r) any( r == 10 ) ), ]

Check:

identical(test1, test2)
# TRUE

my actual question

The above solution (using apply) is fast enough for me.. But I'm not sure it is the optimal solution. I'm pretty new to data.table (compared to some others here on SO), and this is (probably?) not the most efficient/effective/elegant way to achieve the subset I want.

I'm here to learn, so has anyone got a more elegant/better/faster approach to my subsetting question?

update

The question had been marked duplicate... But I'll still post my answers here:

I found the answer from @Marcus to be the best (=readable) code, and the answer from @akrun to be the fastest.

benchmarking

data.table with 1,000,000 rows and 50 columns of interest (i.e. p-columns)

#create sample data
set.seed( 123 )
n   <- 1000000
k   <- 100
dat <- sample( 1:100, n * k, replace = TRUE )
DT  <- as.data.table( matrix( data = dat, nrow = n, ncol = k ) )
setnames( DT, names( DT ), c( paste0( "p", 1:50 ), paste( "r", 1:50 ) ) )

#vector with columns starting with "p"
cols <- grep( "^p", names( DT ), value = TRUE )

apply_method   <- DT[ apply( DT[, ..cols ], 1, function(x) any( x == 10 ) ), ]
reduce_method  <- DT[ DT[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]]
rowsums_method <- DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1 ]

identical(  apply_method, rowsums_method )

microbenchmark::microbenchmark(
  apply   = DT[ apply( DT[ , ..cols ], 1, function(x) any( x == 10 ) ), ],
  reduce  = DT[ DT[, Reduce( `|`, lapply( .SD, `==`, 10 ) ), .SDcols = cols ] ],
  rowSums = DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1, ],
  times = 10
)

#    expr       min        lq      mean    median        uq       max neval
#   apply 3352.0640 3441.7760 3665.5004 3662.7666 3760.7553 4325.9125    10
#  reduce  408.6349  437.6806  552.8850  572.2012  657.6072  710.7699    10
# rowSums  619.2594  663.7325  784.2389  850.0963  868.2096  892.7469    10
like image 744
Wimpel Avatar asked Feb 28 '19 18:02

Wimpel


People also ask

What is setkey R?

Description. setkey sorts a data. table and marks it as sorted with an attribute sorted . The sorted columns are the key. The key can be any number of columns.


1 Answers

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE
like image 185
akrun Avatar answered Sep 28 '22 08:09

akrun