Efficient way to subset data.table based on value in any of selected columns [duplicate]

Tags:

The above solution (using apply) is fast enough for me.. But I'm not sure it is the optimal solution. I'm pretty new to data.table (compared to some others here on SO), and this is (probably?) not the most efficient/effective/elegant way to achieve the subset I want.

I'm here to learn, so has anyone got a more elegant/better/faster approach to my subsetting question?

update

The question had been marked duplicate... But I'll still post my answers here:

I found the answer from @Marcus to be the best (=readable) code, and the answer from @akrun to be the fastest.

benchmarking

data.table with 1,000,000 rows and 50 columns of interest (i.e. p-columns)

#create sample data
set.seed( 123 )
n   <- 1000000
k   <- 100
dat <- sample( 1:100, n * k, replace = TRUE )
DT  <- as.data.table( matrix( data = dat, nrow = n, ncol = k ) )
setnames( DT, names( DT ), c( paste0( "p", 1:50 ), paste( "r", 1:50 ) ) )

#vector with columns starting with "p"
cols <- grep( "^p", names( DT ), value = TRUE )

apply_method   <- DT[ apply( DT[, ..cols ], 1, function(x) any( x == 10 ) ), ]
reduce_method  <- DT[ DT[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]]
rowsums_method <- DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1 ]

identical(  apply_method, rowsums_method )

microbenchmark::microbenchmark(
  apply   = DT[ apply( DT[ , ..cols ], 1, function(x) any( x == 10 ) ), ],
  reduce  = DT[ DT[, Reduce( `|`, lapply( .SD, `==`, 10 ) ), .SDcols = cols ] ],
  rowSums = DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1, ],
  times = 10
)

#    expr       min        lq      mean    median        uq       max neval
#   apply 3352.0640 3441.7760 3665.5004 3662.7666 3760.7553 4325.9125    10
#  reduce  408.6349  437.6806  552.8850  572.2012  657.6072  710.7699    10
# rowSums  619.2594  663.7325  784.2389  850.0963  868.2096  892.7469    10

744

asked Feb 28 '19 18:02

Wimpel

1 Answers

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE

185

answered Sep 28 '22 08:09

akrun

Related questions
                            
                                Read Excel file and select specific rows and columns
                            
                                R Plotly: Split legend: symbols and color
                            
                                Error in fitdist with gamma distribution
                            
                                Function to extract all list elements from a dataframe column into individual columns
                            
                                fread to read top n rows from a large file
                            
                                R move whole folder to another directory
                            
                                How to import pandas using R studio
                            
                                GGanimate: geom_text with numeric values animates at decimal numbers instead of integers
                            
                                ggplot: Boxplot by several categorical variables
                            
                                Set the max value in colormap when using scale_color_viridis
                            
                                Enter value from df based on condition across multiple columns into new variable
                            
                                Schema file does not exist in XBRL Parse file
                            
                                R ggplot facet_wrap with different y-axis labels, one values, one percentages
                            
                                In R, is growing a list just as inefficient as growing a vector?
                            
                                How to layout 2 rows followed by 1 column with renderPlot in rmarkdown html_notebook with runtime shiny
                            
                                R shinydashboard: specifying div style width argument as percentage to fit a resizeable JS plot
                            
                                For R: How to exclude some data files based on file language
                            
                                gganimate round values during transition
                            
                                Sankey Diagram in R with networkD3 - row number issues
                            
                                How to bind two lists with same structure?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With