Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When subsetting rows with a factor with equal (==), NA's are also included. It doesn't happen with %in%. Is it normal?

Tags:

r

na

equals

subset

Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do

subset1 <- df[df$A=="A1",]  
dim(subset1)  # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),] 
dim(subset2)  # 10, as expected
summary(subset2$A) # only A1 has non-zero count

And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in% for factors and always include !is.na when using equal? Thanks!

like image 559
user3707392 Avatar asked Jun 04 '14 14:06

user3707392


1 Answers

Yes, the return types of == and %in% are different with respect to NA because of how "%in%" is defined...

# Data...
x <- c("A",NA,"A")

# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE   NA TRUE
x[x=="A"]
#[1] "A" NA  "A"

# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1]  TRUE FALSE  TRUE
x[ x %in% "A" ]
#[1] "A" "A"

This is because (from the docs)...

%in% is an alias for match, which is defined as

"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0

If we redefine it to the standard definition of match you will see that it behaves in the same way as ==

"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE   NA TRUE
like image 196
Simon O'Hanlon Avatar answered Nov 19 '22 05:11

Simon O'Hanlon