Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NA matches NA, but is not equal to NA. Why?

Tags:

r

In R Language Definition, NA values are briefly described, a portion of which says

... In particular, FALSE & NA is FALSE, TRUE | NA is TRUE. NA is not equal to any other value or to itself; testing for NA is done using is.na. However, an NA value will match another NA value in match.

Regarding the statement "NA is not equal to any other value or to itself",


Updated: The question, revised again, is

What is the reasoning, if any, behind NA matching NA in match, and nowhere else in the language?

It doesn't make sense to me that a missing value, unknown by anyone (or it would not be missing), would match another missing value of the same type. Since I posted this, I came across something in example(match) that provides some reasoning. Character coercion changes its type. I can erase it completely if I like.

match(NA, NA)
# [1] 1
match(NA, NA_real_)
# [1] 1
match(NA_character_, NA_real_)
# [1] 1
match(paste(NA), NA)
# [1] NA
gsub("NA", "", NA)
# [1] NA
gsub("NA", "", paste(NA))
# [1] ""
is.na(NA)
# [1] TRUE
is.na(paste(NA))
# [1] FALSE

Apologies for stirring the pot, but some of the documentation is unclear about this. It might boil down to the R parser/deparser and the fact that you can turn anything into a text character object in R.


Original Post:

Now referring to "However, an NA value will match another NA value in match."

If NA is it not equal to itself, why is it matched with itself in match? and also in identical? Is this done on purpose?

NA == NA  ## expecting TRUE
# [1] NA
NA != NA
# [1] NA
x <- NA
x == x
# [1] NA
match(NA, NA)
# [1] 1
identical(NA, NA)
# [1] TRUE
all.equal(NA, NA)
# [1] TRUE
like image 576
Rich Scriven Avatar asked Aug 03 '14 02:08

Rich Scriven


2 Answers

It's a matter of convention. There are good reasons for the way == works. NA is a special value in R that is supposed to represent data that is missing and should be treated differently from the rest of data. There are innumerable very subtle bugs that could come up if we started comparing missing values as if they were known or as if two missing values were equal to each other.

Think of NA as meaning "I don't know what's there". The correct answer to 3 > NA is obviously NA because we don't know if the missing value is larger than 3 or not. Well, it's the same for NA == NA. They are both missing values but the true values could be quite different, so the correct answer is "I don't know."

R doesn't know what you are doing in your analysis, so instead of potentially introducing bugs that would later end up being published and embarrassing you, it doesn't allow comparison operators to think NA is a value.

match() was written with a more specific purpose in mind: finding the indexes of matching values. If you ask the question "Should I match 3 with NA", a reasonable answer is "no." Different (and very useful) convention, and justified because R pretty much knows what you are trying to do when you invoke match(). Now, should we match NA with NA for this purpose? It could be argued.

Come to think of it, I suppose it is a a little odd that the authors of match() chose to allow NA to match to itself by default. You can imagine cases where you might use match() to find NA rows in table along with other values, but it's dangerous. You just have to be a bit more careful about knowing whether you have any NA values in x and only permitting them if you really wanted to. You can change this behavior by specifying incomparables=NA when calling match().

like image 90
farnsy Avatar answered Oct 22 '22 16:10

farnsy


To add to @farnsy's great answer, and to elaborate on the difference with == and match:

The key thing to consider is how these two functions (== and match) are used.

x == y
translation:  Is the value on the left the same value as the one on the right

match(x, table)
translation:  Is the value on the left found in the table on the right; 
              if so, return the index of the FIRST TIME that x appears in table

A common use case I often encounter is working with a set of IDs. Especially, when dealing with two different datasets that have been joined, I might be left with several NAs in one of my ID columns

However, not all NAs represent the same real life object.

like image 26
Ricardo Saporta Avatar answered Oct 22 '22 16:10

Ricardo Saporta