Consider the following code. When you don't explicitly test for <code>NA</code> in your condition, that code will fail at some later date then your data changes. <pre class="prettyprint"><code>> # A toy example > a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1))) > a col1 col2 col3 col4 1 1 2 1 4 2 2 NA 2 3 3 3 2 3 2 4 4 3 4 1 > > # Bummer, there's an NA in my condition > a$col2==2 [1] TRUE NA TRUE FALSE > > # Why is this a good thing to do? > # It NA'd the whole row, and kept it > a[a$col2==2,] col1 col2 col3 col4 1 1 2 1 4 NA NA NA NA NA 3 3 2 3 2 > > # Yes, this is the right way to do it > a[!is.na(a$col2) & a$col2==2,] col1 col2 col3 col4 1 1 2 1 4 3 3 2 3 2 > > # Subset seems designed to avoid this problem > subset(a, col2 == 2) col1 col2 col3 col4 1 1 2 1 4 3 3 2 3 2 </code></pre> Can someone explain why the behavior you get without the <code>is.na</code> check would ever be good or useful?

I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The <code>==</code> operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states: <blockquote> Missing values ('NA') and 'NaN' values are regarded as non-comparable even to themselves, so comparisons involving them will always result in 'NA'. </blockquote> In other words, a missing value isn't comparable using a binary operator (because it's unknown). Beyond is.na(), you could also do: <pre class="prettyprint"><code>which(a$col2==2) # tests explicitly for TRUE </code></pre> Or <pre class="prettyprint"><code>a$col2 %in% 2 # only checks for 2 </code></pre> %in% is defined as using the <code>match()</code> function: <pre class="prettyprint"><code>'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' </code></pre> This is also covered in "The R Inferno". Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design". <h3>Update: How is NA handled when there are multiple logical conditions?</h3> <code>NA</code> is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. <code>NA | TRUE == TRUE</code>). These truth tables from <code>?Logic</code> may provide a useful illustration: <pre class="prettyprint"><code>outer(x, x, "&") ## AND table # <NA> FALSE TRUE #<NA> NA FALSE NA #FALSE FALSE FALSE FALSE #TRUE NA FALSE TRUE outer(x, x, "|") ## OR table # <NA> FALSE TRUE #<NA> NA NA TRUE #FALSE NA FALSE TRUE #TRUE TRUE TRUE TRUE </code></pre>

An NA in subsetting a data.frame does something unexpected

Tags:

r

Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.

>   # A toy example
>   a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
>   a
  col1 col2 col3 col4
1    1    2    1    4
2    2   NA    2    3
3    3    2    3    2
4    4    3    4    1
>   
>   # Bummer, there's an NA in my condition
>   a$col2==2
[1]  TRUE    NA  TRUE FALSE
> 
>   # Why is this a good thing to do?
>   # It NA'd the whole row, and kept it
>   a[a$col2==2,]
   col1 col2 col3 col4
1     1    2    1    4
NA   NA   NA   NA   NA
3     3    2    3    2
>   
>   # Yes, this is the right way to do it
>   a[!is.na(a$col2) & a$col2==2,]
  col1 col2 col3 col4
1    1    2    1    4
3    3    2    3    2
>     
>   # Subset seems designed to avoid this problem
>   subset(a, col2 == 2)
  col1 col2 col3 col4
1    1    2    1    4
3    3    2    3    2

Can someone explain why the behavior you get without the is.na check would ever be good or useful?

269

asked Nov 20 '09 21:11

Craig Schmidt

1 Answers

I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:

Missing values ('NA') and 'NaN' values are regarded as non-comparable even to themselves, so comparisons involving them will always result in 'NA'.

In other words, a missing value isn't comparable using a binary operator (because it's unknown).

Beyond is.na(), you could also do:

which(a$col2==2) # tests explicitly for TRUE

a$col2 %in% 2 # only checks for 2

%in% is defined as using the match() function:

'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'

This is also covered in "The R Inferno".

Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".

Update: How is NA handled when there are multiple logical conditions?

NA is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE). These truth tables from ?Logic may provide a useful illustration:

outer(x, x, "&") ## AND table
#       <NA> FALSE  TRUE
#<NA>     NA FALSE    NA
#FALSE FALSE FALSE FALSE
#TRUE     NA FALSE  TRUE

outer(x, x, "|") ## OR  table
#      <NA> FALSE TRUE
#<NA>    NA    NA TRUE
#FALSE   NA FALSE TRUE
#TRUE  TRUE  TRUE TRUE

161

answered Oct 17 '22 12:10

Shane

Related questions
                            
                                Label and color leaf dendrogram
                            
                                Combining matrices by alternating columns
                            
                                Format column within dplyr chain
                            
                                Dealing with the class imbalance in binary classification
                            
                                Hyperlink in R document
                            
                                Set dbGetQuery to return integer64 as integer
                            
                                Aggregating hourly data into daily aggregates
                            
                                Storing multiple data frames into one data structure - R
                            
                                How to calculate first derivative of time series
                            
                                Boxplots for groups?
                            
                                Plot histograms over factor variables
                            
                                Adding a color legend to an image
                            
                                Replace values in data frame with other values according to a rule
                            
                                Add a transparent window/keyhole ggplot2 (grid)
                            
                                Using R to download zipped data file, extract, and import .csv
                            
                                .onLoad failed in loadNamespace() for 'rJava' when installing a package
                            
                                Create SpatialPointsDataframe
                            
                                Passing data within Shiny Modules from Module 1 to Module 2
                            
                                Decrease overal legend size (elements and text)
                            
                                Getting the state of variables after an error occurs in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With