Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep behavior is odd for NA or "" entries

Tags:

r

I am fairly new to R and am working with a vector with empty entries and noticed that grep acts counter-intuitively with my data. I'm just going to work with an example as I am not 100% sure how to explain it. Say I have three vectors:

A<-c("","","","","","","a")
B<-c(NA,NA,NA,NA,NA,NA,"a")

A is how the data was stored originally, and B is how R is reading my data. Running > vec[grep("",vec, invert=TRUE)] -to my understanding- searches vec for all empty cells, return their indices, then populates and displays a result vector with non-empty data entries. However when I run this for vec=A and vec=B I get:

vec = A:

> A[grep("",A, invert=FALSE)]
[1] "" "" "" "" "" "" "" "a" 
> A[grep("",A, invert=TRUE)]
character(0)

vec = B:

> B[grep("",B, invert=FALSE)]
[1] "a"
> B[grep("",B, invert=TRUE)]
[1] NA NA NA NA NA NA

Since I thought my data was being read like case B I was stumped by the counter-intuitive result. I realize this could simply be a variable-type issue however I was wondering if someone could shed some more light on the situation as to what is going on.

quick edit Case A makes sense: since grep can't find "" because the variable types are off, it returns everything. Inverted, it returns character(0) as the default for "nothing". Still confused by case B.

like image 912
stites Avatar asked Dec 21 '22 13:12

stites


1 Answers

Note that grep performs regular expression searches (not string matching).

The regex "" that you have fed in is empty, so running grep asks if any of the strings it is matching against contains "", not whether the string entirely matches "".

For example,

grepl("a","bananas")

returns TRUE because "a" is in "bananas".

If you want to match the entire string against "", you can use '^' and '$' in your regex ('^' means start of string, '$' means end of string):

grepl("^$", "") # returns TRUE
grepl("^$", "a") # returns FALSE

However you're probably better off not using regex at all if it's just empty cells you want:

A[A != ""] # returns "a"
B[!is.na(B)] # returns "a"
like image 78
mathematical.coffee Avatar answered Jan 21 '23 06:01

mathematical.coffee