I have a data.frame of 10 different columns (length of each column is the same). I want to eliminate any column that has 'NA' greater than 15% of the column length.
Do I first need to make a function for calculating the percentage of NA for each column and then make another data.frame where I apply the function? What's the best way to do this?
First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.
set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))
Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf) does a logical check and returns a data.frame of TRUE and FALSE. Since TRUE and FALSE equate to 1 and 0, we can just use colMeans to get the mean of the number of TRUE (is NA) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA values?
colMeans(is.na(mydf)) > .15
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:
> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100 5
> names(final)
[1] "X3" "X4" "X5" "X7" "X10"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With