Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subseting with grep in data.table - unpredicatble

Tags:

r

data.table

Why is "grep" causing problems in the below data.table calls.

set.seed(45)
dt <- data.table(
    col1 = sample(letters[1:2],10, replace=TRUE), 
    col2=sample(letters[1:5], 10, replace=TRUE),
    col3=runif(10,1,5))

Subsetting like this, works:

dt[col1=="b" & col2=="b",] # Works
    col1 col2     col3
1:    b    b    1.5166

But this throws a warning and returns wrong data (or no warning and wrong data)

dt[grep("b", col1) & col2=="b",] # does not
# with seed = 42
> Warning message: In grep("b", col1) & col2 == "b" :   longer object
> length is not a multiple of shorter object length
# with seed = 45
   col1 col2     col3
1:    b    b 1.516600
2:    a    b 3.342007
3:    a    b 1.865772

I can avoid this confusion by tying the subsets together:

dt[grep("b", col1),][col2=="b",]

But that is not very elegant.

ps. I guess the problem is different than here

like image 613
Andreas Avatar asked Feb 08 '23 13:02

Andreas


1 Answers

The output of grep is a numeric vector. It can be of length anywhere between 0 to the length of the original vector depending on how many matches are there. But, if we use grepl, the return vector is logical and it will always be of the same length as the original vector. If there are no matches, only difference is that it will be all FALSE. In that respect, the below code should work fine.

dt[grepl("b", col1) & col2=="b"]
like image 187
akrun Avatar answered Feb 16 '23 03:02

akrun