Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Show columns with NAs in a data.frame

Tags:

I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like

names(df[is.na(colMeans(df))])

returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.

nacols <- function(df) {
  names(df[,!complete.cases(t(df))])
} 

w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)

nacols(tmp)
[1] "y" "z"

Can someone show me a more efficient function to identify columns that have NAs?

like image 657
Moira Avatar asked May 13 '12 18:05

Moira


People also ask

How do I find NAS in a column in R?

In R, the easiest way to find columns that contain missing values is by combining the power of the functions is.na() and colSums(). First, you check and count the number of NA's per column. Then, you use a function such as names() or colnames() to return the names of the columns with at least one missing value.

How do I show Na in R?

In R the missing values are coded by the symbol NA . To identify missings in your dataset the function is is.na() . When you import dataset from other statistical applications the missing values might be coded with a number, for example 99 . In order to let R know that is a missing value you need to recode it.

How do you check which columns have NULL values in pandas?

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values. Code #1: Python.

How do you find null values in R?

You can use the is. null function in R to test whether a data object is NULL.


2 Answers

This is the fastest way that I know of:

unlist(lapply(df, function(x) any(is.na(x))))

EDIT:

I guess everyone else wrote it out complete so here it is complete:

nacols <- function(df) {
    colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}

And if you microbenchmark the 4 solutions on a WIN 7 machine:

Unit: microseconds
    expr     min      lq  median      uq        max
1 ANDRIE  85.380  91.911 106.375 116.639    863.124
2 MANOEL  87.712  93.778 105.908 118.971   8426.886
3  MOIRA 764.215 798.273 817.402 876.188 143039.632
4  TYLER  51.321  57.853  62.518  72.316   1365.136

And here's a visual of that: enter image description here

Edit At the time I wrote this anyNA did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA:

The generic function anyNA implements any(is.na(x)) in a possibly faster way (especially for atomic vectors).

nacols <- function(df) {
    colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}
like image 186
Tyler Rinker Avatar answered Nov 18 '22 15:11

Tyler Rinker


Here is one way:

colnames(tmp)[colSums(is.na(tmp)) > 0]

Hope it helps,

Manoel

like image 45
Manoel Galdino Avatar answered Nov 18 '22 13:11

Manoel Galdino