Let's say I have a data.table and I want to select all the rows where the variable x has a value of b. That is easy
library(data.table) DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) setkey(DT,x) # set a 1-column key DT["b"]
By the way, it appears that one has to set a key, if the key is not set to x then this does not work. By the way what would happen if I set two columns as keys?
Anyway, moving along, lets say that I want to select all the rows where the variable x was a or b
DT["b"|"a"]
does not work
But the following works
DT[x=="a"|x=="b"]
But that uses vector scanning a la data frames. It does not use the binary search. I guess for smaller data sets it will not matter.
Is that what I should do or am I ignorant of data.table syntax?
And one more thing. Are there any examples of more complex Boolean multi-variable selection (or subset) procedures with data.table?
I know I could always revert to using the subset() function since a data.table will behave as a data.frame if it must.
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
R – Get Multiple Rows of Matrix To get multiple rows of matrix, specify the row numbers as a vector followed by a comma, in square brackets, after the matrix variable name. This expression returns the required rows as a matrix.
To specify a logical expression for the rows parameter, use the standard R operators. If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,] .
Multiple conditions can also be combined using which() method in R. The which() function in R returns the position of the value which satisfies the given condition. The %in% operator is used to check a value in the vector specified.
Here is a way that only crossed my mind after I asked the question and it works but I do not know how it does in benchmarks. I am not currently at a computer with an installed R. I guess I should use a cloud instance. Anyway, I like the syntax
DT[c("a","b")]
Using the %in%
operator seems to give a factor of 2 performance bump. Consider:
library(data.table) library(rbenchmark) DT <- data.table(x=sample(letters, 1e6, TRUE), y=rnorm(1e6), v=runif(1e6)) setkey(DT,x) # set a 1-column key DT["b"] f1 <- function() DT[x %in% letters[1:2]] f2 <- function() DT[x=="a"| x == "b"] > benchmark(f1(),f2()) test replications elapsed relative user.self sys.self user.child sys.child 1 f1() 100 8.40 1.000000 7.58 0.81 NA NA 2 f2() 100 17.11 2.036905 15.54 1.56 NA NA > all.equal(f1(), f2()) [1] TRUE
EDIT: Adding Farrel's option
Note, this is on a different computer, but the relative bumps are the same.
f3 <- function() DT[c("a", "b")] test replications elapsed relative user.self sys.self user.child sys.child 1 f1() 100 11.281 7.121843 9.745 1.323 0 0 2 f2() 100 23.106 14.587121 20.824 2.224 0 0 3 f3() 100 1.584 1.000000 1.042 0.541 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With