Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter data.frame by a factor that includes NA as level

If you have a data.frame with factors that do not include NAs as levels you can filter your data without issues.

set.seed(123)
df=data.frame(a = factor(as.character(c(1, 1, 2, 2, 3, NA,3,NA)),exclude=NULL),
           b= runif(8))
#str(df)
df[df$a==3,]
#      a         b
#    5 3 0.9404673
#    7 3 0.5281055

The issues appears if you need to filter by the NA level. The following does not work:

df[df$a==NA,]
df[df$a=="NA",]
df[is.na(df$a),]

The only way I found is converting the factor to numeric and compare it to the number of levels.

df[as.numeric(df$a)==4,]
#     a         b
#6 <NA> 0.0455565
#8 <NA> 0.8924190

Is there any other more intuitive/elegant way to get the same result?

like image 723
Robert Avatar asked Sep 25 '17 18:09

Robert


2 Answers

Check if the levels of the corresponding df$a is na:

df[is.na(levels(df$a)[df$a]),]
     a         b
6 <NA> 0.1649003
8 <NA> 0.6556045

As Frank pointed out, this also includes observations where the value of df$a, not just it's level, is NA. I guess the original poster wanted to include these cases. If not, one can do something like

x <- factor(c("A","B", NA), levels=c("A", NA), exclude = NULL)
i <- which(is.na(levels(x)[x]))
i[!is.na(x[i])]

gives you 3, only the NA-level, leaving out unknown level (B).

like image 61
Ott Toomet Avatar answered Oct 27 '22 14:10

Ott Toomet


In case you also have true missing values (that don't belong to the factor's levels)...

DF = data.frame(
  x = factor(c("A", "B", NA), levels=c("A", NA), exclude=NULL),
  v = 1:3
)

Row 3's x has level NA, while row 2 is a true missing value.

To get just row 3, you could do a join with data.table...

library(data.table)
setDT(DF)

merge(DF, data.table(x = factor(NA_character_, exclude=NULL)))
# or
DF[.(factor(NA_character_, exclude=NULL)), on=.(x), nomatch=0]    

#     x v
# 1: NA 3

Or somewhat more awkwardly in dplyr:

dplyr::right_join(DF, 
  data.frame(x = factor(NA_character_, levels=levels(DF$x), exclude=NULL)))

# Joining, by = "x"
#      x v
# 1 <NA> 3

I could find no way to get here in base, except the crazy...

wv = which(is.na(levels(DF$x)))
DF[ !is.na(DF$x) & as.integer(DF$x) == wv, ]

#      x v
# 3 <NA> 3
like image 43
Frank Avatar answered Oct 27 '22 14:10

Frank