Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Viewing all column names with any NA in R

Tags:

r

sapply

I need to get the name of the columns that have at least 1 NA.

df<-data.frame(a=1:3,b=c(NA,8,6), c=c('t',NA,7))

I need to get "b, c".

I found this code:

sapply(df, function(x) any(is.na(x)))

But I need only the variables that have any NA.

I tried this:

sapply(df, function(x) colnames(df[,any(is.na(x))]))

But I get all the column names.

like image 336
GabyLP Avatar asked Sep 28 '14 13:09

GabyLP


3 Answers

Try the data.table version:

library(data.table)
setDT(df)
names(df)[df[,sapply(.SD, function(x) any(is.na(x))),]]
[1] "b" "c"

Microbenchmarking using @akrun's code:

set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)


f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
           names(df1)[contains_any_na]}

f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
f3 <- function() { names(df1)[!!colSums(is.na(df1))] }

f4 <- function() { names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }

microbenchmark(f1(), f2(), f3(), f4(), unit="relative")   
# Unit: relative
#  expr       min        lq    median       uq      max neval
#  f1()  1.000000  1.000000  1.000000 1.000000 1.000000   100
#  f2() 10.459124 10.928821 10.955986 9.858967 7.069066   100
#  f3()  3.323144  3.805183  4.159624 3.775549 2.797329   100
#  f4() 10.108998 10.242207 10.121022 9.117067 6.576976   100

@agstudy : This solution is similar in speed to colnames(df1)[!complete.cases(t(df1))].

like image 71
rnso Avatar answered Sep 22 '22 17:09

rnso


Another acrobatic solution (just for fun) :

colnames(df)[!complete.cases(t(df))]
[1] "b" "c"

The idea is : Getting the columns of A that have at least 1 NA is equivalent to get the rows that have at least NA for t(A). complete.cases by definition (very efficient since it is just a call to C function) gives the rows without any missing value.

like image 31
agstudy Avatar answered Sep 22 '22 17:09

agstudy


You were very close. Your first try yields a boolean vector, which you can use to index the names of df:

contains_any_na = sapply(df, function(x) any(is.na(x)))
names(df)[contains_any_na]
# [1] "b" "c"

Update Jan 14, 2017: As of R version 3.1.0, anyNA() can be used as an alternative to any(is.na(.)), and the above code can be simplified to

names(df)[sapply(df, anyNA)]
# [1] "b" "c"
like image 29
Paul Hiemstra Avatar answered Sep 24 '22 17:09

Paul Hiemstra