Let's say I've got a data.table with 6 columns
library(data.table)
set.seed(123)
dt <- data.table( id = 1:100,
p1 = sample(1:10, 100, replace = TRUE ),
p2 = sample(1:10, 100, replace = TRUE ),
p3 = sample(1:10, 100, replace = TRUE ),
p4 = sample(1:10, 100, replace = TRUE ),
p5 = sample(1:10, 100, replace = TRUE ) )
Now, I want to subset this data.table, on the p1 - pn columns (here: p1-p5). I want to keep all rows where any of the p-columns contains the value of 10
.
For this small sample data.table, this can be done manually with
test1 <- dt[ p1 == 10 | p2 == 10 | p3 == 10 | p4 == 10 | p5 == 10, ]
But my production data contains dozens of p-columns, so typing them all out manually would be a pain...
My current solution is to first create a vector with the column-names i need:
cols <- grep( "^p", names( dt ), value = TRUE )
...and then do the subsetting using apply
:
test2 <- dt[ apply( dt[, ..cols ], 1, function(r) any( r == 10 ) ), ]
Check:
identical(test1, test2)
# TRUE
my actual question
The above solution (using apply
) is fast enough for me.. But I'm not sure it is the optimal solution. I'm pretty new to data.table (compared to some others here on SO), and this is (probably?) not the most efficient/effective/elegant way to achieve the subset I want.
I'm here to learn, so has anyone got a more elegant/better/faster approach to my subsetting question?
The question had been marked duplicate... But I'll still post my answers here:
I found the answer from @Marcus to be the best (=readable) code, and the answer from @akrun to be the fastest.
benchmarking
data.table with 1,000,000 rows and 50 columns of interest (i.e. p-columns)
#create sample data
set.seed( 123 )
n <- 1000000
k <- 100
dat <- sample( 1:100, n * k, replace = TRUE )
DT <- as.data.table( matrix( data = dat, nrow = n, ncol = k ) )
setnames( DT, names( DT ), c( paste0( "p", 1:50 ), paste( "r", 1:50 ) ) )
#vector with columns starting with "p"
cols <- grep( "^p", names( DT ), value = TRUE )
apply_method <- DT[ apply( DT[, ..cols ], 1, function(x) any( x == 10 ) ), ]
reduce_method <- DT[ DT[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]]
rowsums_method <- DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1 ]
identical( apply_method, rowsums_method )
microbenchmark::microbenchmark(
apply = DT[ apply( DT[ , ..cols ], 1, function(x) any( x == 10 ) ), ],
reduce = DT[ DT[, Reduce( `|`, lapply( .SD, `==`, 10 ) ), .SDcols = cols ] ],
rowSums = DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1, ],
times = 10
)
# expr min lq mean median uq max neval
# apply 3352.0640 3441.7760 3665.5004 3662.7666 3760.7553 4325.9125 10
# reduce 408.6349 437.6806 552.8850 572.2012 657.6072 710.7699 10
# rowSums 619.2594 663.7325 784.2389 850.0963 868.2096 892.7469 10
Description. setkey sorts a data. table and marks it as sorted with an attribute sorted . The sorted columns are the key. The key can be any number of columns.
One option is to specify the 'cols' of interest in .SDcols
, loop through the Subset of Data.table (.SD
), generate a list
of logical vectors, Reduce
it to single logical vector with (|
) and use that to subset the rows
i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With