I have a dataset df
and I would like to remove all rows for which variable y
does not have the value a
. Variable y
also contains some NAs
:
df <- data.frame(x=1:3, y=c('a', NA, 'c'))
I can achieve this using R's indexing syntax like this:
df[df$y!='a',]
x y
2 <NA>
3 c
Note this returns both the NA
and the value c
- which is what I want.
However, when I try the same thing using subset
or dplyr::filter
, the NA
gets stripped out:
subset(df, y!='a')
x y
3 c
dplyr::filter(df, y!='a')
x y
3 c
Why do subset
and dplyr::filter
work like this? It seems illogical to me - an NA
is not the same as a
, so why strip out the NA
when I specifiy I want all rows except those where variable y
equals a
?
And is there some way to change the behaviour of these functions, other than explicitly asking for NAs
to get returned, i.e.
subset(df, y!='a' | is.na(y))
Thanks
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
This is arguably more wrong than what subset
and dplyr::filter
return. Remember that in R, NA
really is intended to mean "unknown", so df$y != 'a'
returns,
> df$y != 'a'
[1] FALSE NA TRUE
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NA
s.
Many people dislike this behavior, but it is what it is.
subset
and dplyr::filter
make a different default choice which is to simply drop the NA
rows, which arguably is accurate-ish.
But really, the lesson here is that if your data has NA
s, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a'
, or as mentioned in the other answer by using %in%
which is based on match
.
From base::Extract
:
When extracting, a numerical, logical or character
NA
index picks an unknown element and so returnsNA
From ?base::subset
:
missing values are taken as false [...] For ordinary vectors, the result is simply
x[subset & !is.na(subset)]
From ?dplyr::filter
Unlike base subsetting with
[
, rows where the condition evaluates toNA
are dropped
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With