Consider the following code. When you don't explicitly test for NA
in your condition, that code will fail at some later date then your data changes.
> # A toy example
> a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
> a
col1 col2 col3 col4
1 1 2 1 4
2 2 NA 2 3
3 3 2 3 2
4 4 3 4 1
>
> # Bummer, there's an NA in my condition
> a$col2==2
[1] TRUE NA TRUE FALSE
>
> # Why is this a good thing to do?
> # It NA'd the whole row, and kept it
> a[a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
NA NA NA NA NA
3 3 2 3 2
>
> # Yes, this is the right way to do it
> a[!is.na(a$col2) & a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
>
> # Subset seems designed to avoid this problem
> subset(a, col2 == 2)
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
Can someone explain why the behavior you get without the is.na
check would ever be good or useful?
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
In R, the easiest way to find columns that contain missing values is by combining the power of the functions is.na() and colSums(). First, you check and count the number of NA's per column. Then, you use a function such as names() or colnames() to return the names of the columns with at least one missing value.
Selecting rows based on a condition (logical subsetting) Because it allows you to easily combine conditions from multiple columns, logical subsetting is probably the most commonly used technique for extracting rows out of a data frame.
We can select rows (observations) by Index in R by using a single square bracket operator df[rows,columns] , From the square bracket, we should be using rows position, and columns are used to select columns. In R rows are called observations and columns are called variables.
I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The ==
operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:
Missing values ('NA') and 'NaN' values are regarded as non-comparable even to themselves, so comparisons involving them will always result in 'NA'.
In other words, a missing value isn't comparable using a binary operator (because it's unknown).
Beyond is.na(), you could also do:
which(a$col2==2) # tests explicitly for TRUE
Or
a$col2 %in% 2 # only checks for 2
%in% is defined as using the match()
function:
'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
This is also covered in "The R Inferno".
Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".
NA
is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE
). These truth tables from ?Logic
may provide a useful illustration:
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
#<NA> NA FALSE NA
#FALSE FALSE FALSE FALSE
#TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
#<NA> NA NA TRUE
#FALSE NA FALSE TRUE
#TRUE TRUE TRUE TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With