I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like
names(df[is.na(colMeans(df))])
returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.
nacols <- function(df) {
names(df[,!complete.cases(t(df))])
}
w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)
nacols(tmp)
[1] "y" "z"
Can someone show me a more efficient function to identify columns that have NAs?
In R, the easiest way to find columns that contain missing values is by combining the power of the functions is.na() and colSums(). First, you check and count the number of NA's per column. Then, you use a function such as names() or colnames() to return the names of the columns with at least one missing value.
In R the missing values are coded by the symbol NA . To identify missings in your dataset the function is is.na() . When you import dataset from other statistical applications the missing values might be coded with a number, for example 99 . In order to let R know that is a missing value you need to recode it.
In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values. Code #1: Python.
You can use the is. null function in R to test whether a data object is NULL.
This is the fastest way that I know of:
unlist(lapply(df, function(x) any(is.na(x))))
EDIT:
I guess everyone else wrote it out complete so here it is complete:
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
And if you microbenchmark the 4 solutions on a WIN 7 machine:
Unit: microseconds
expr min lq median uq max
1 ANDRIE 85.380 91.911 106.375 116.639 863.124
2 MANOEL 87.712 93.778 105.908 118.971 8426.886
3 MOIRA 764.215 798.273 817.402 876.188 143039.632
4 TYLER 51.321 57.853 62.518 72.316 1365.136
And here's a visual of that:
Edit At the time I wrote this anyNA
did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA
:
The generic function
anyNA
implementsany(is.na(x))
in a possibly faster way (especially for atomic vectors).
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}
Here is one way:
colnames(tmp)[colSums(is.na(tmp)) > 0]
Hope it helps,
Manoel
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With