I have a data frame with some columns with missing values. Is there a way (using dplyr) to efficiently calculate the percentage of each column that is missing i.e. NA. Sought of like a colSum equivalent. So I dont have to calculate each column percentage missing individually ?
To find the percentage of missing values in each column of an R data frame, we can use colMeans function with is.na function. This will find the mean of missing values in each column. After that we can multiply the output with 100 to get the percentage.
To calculate percent, we need to divide the counts by the count sums for each sample, and then multiply by 100. This can also be done using the function decostand from the vegan package with method = "total" .
You can use the is.na() function for this purpose. You can use the rowSums() function to do this. As the name suggests, this function sums the values of all elements in a row. Since TRUEs are equal to 1 and FALSEs are equal to 0, summing the number of TRUEs is the same as counting the number of NA's.
R automatically converts logical vectors to integer vectors when using arithmetic functions. In the process TRUE gets turned to 1 and FALSE gets converted to 0 . Thus, sum(is.na(x)) gives you the total number of missing values in x .
First, I created a test data for you:
a<- c(1,NA,NA,4)
b<- c(NA,2,3,4)
x<- data.frame(a,b)
x
# a b
# 1 1 NA
# 2 NA 2
# 3 NA 3
# 4 4 4
Then you can use colMeans(is.na(x))
:
colMeans(is.na(x))
# a b
# 0.50 0.25
We can use summarise_each
library(dplyr)
x %>%
summarise_each(funs(100*mean(is.na(.))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With