Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]

Tags:

dataframe

r

I have a data.frame of 10 different columns (length of each column is the same). I want to eliminate any column that has 'NA' greater than 15% of the column length.

Do I first need to make a function for calculating the percentage of NA for each column and then make another data.frame where I apply the function? What's the best way to do this?

like image 719
user1577962 Avatar asked Dec 27 '22 20:12

user1577962


1 Answers

First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.

set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))

Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf) does a logical check and returns a data.frame of TRUE and FALSE. Since TRUE and FALSE equate to 1 and 0, we can just use colMeans to get the mean of the number of TRUE (is NA) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA values?

colMeans(is.na(mydf)) > .15
#    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10 
#  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:

> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100   5
> names(final)
[1] "X3"  "X4"  "X5"  "X7"  "X10"
like image 195
A5C1D2H2I1M1N2O1R2T1 Avatar answered Jan 25 '23 23:01

A5C1D2H2I1M1N2O1R2T1