I have a data.frame
of 10 different columns (length of each column is the same). I want to eliminate any column that has 'NA
' greater than 15% of the column length.
Do I first need to make a function for calculating the percentage of NA
for each column and then make another data.frame
where I apply the function? What's the best way to do this?
First, it's always good to share some sample data. It doesn't need to be your actual data--something made up is fine.
set.seed(1)
x <- rnorm(1000)
x[sample(1000, 150)] <- NA
mydf <- data.frame(matrix(x, ncol = 10))
Second, you can just use inbuilt functions to get what you need. Here, is.na(mydf)
does a logical check and returns a data.frame
of TRUE
and FALSE
. Since TRUE
and FALSE
equate to 1
and 0
, we can just use colMeans
to get the mean of the number of TRUE
(is NA
) values. That, in turn, can be checked according to your stipulations, in this case which columns have more than 15% NA
values?
colMeans(is.na(mydf)) > .15
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
As we can see, we should drop X1, X2, X6, X8, and X9. Again, taking advantage of logical vectors, here's how:
> final <- mydf[, colMeans(is.na(mydf)) <= .15]
> dim(final)
[1] 100 5
> names(final)
[1] "X3" "X4" "X5" "X7" "X10"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With