Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditionally selecting columns in dplyr where certain proportion of values is NA

Data

I'm working with a data set resembling the data.frame generated below:

set.seed(1)
dta <- data.frame(observation = 1:20,
                  valueA = runif(n = 20),
                  valueB = runif(n = 20),
                  valueC = runif(n = 20),
                  valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA

The columns have NA values with the last column having more than 60% of observations NAs.

> sapply(dta, function(x) {table(is.na(x))})
$observation

FALSE 
   20 

$valueA

FALSE 
   20 

$valueB

FALSE  TRUE 
   16     4 

$valueC

FALSE  TRUE 
   11     9 

$valueD

FALSE  TRUE 
    6    14 

Problem

I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.

Attempts

This can be easily done in base. For example to select columns with less than 50% NAs I can do:

dta[, colSums(is.na(dta)) < nrow(dta) / 2]

which produces:

> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
  observation    valueA    valueB    valueC
1           1 0.2655087 0.9347052 0.8209463
2           2 0.3721239        NA        NA

Task

I'm interested in achieving the same flexibility in dplyr pipe line:

Vectorize(require)(package = c("dplyr",         # Data manipulation
                               "magrittr"),     # Reverse pipe

char = TRUE)

dta %<>%
  # Some transformations I'm doing on the data
  mutate_each(funs(as.numeric)) %>% 
  # I want my select to take place here
like image 600
Konrad Avatar asked Jan 18 '16 10:01

Konrad


2 Answers

Update for 2020 perhaps, now that dplyr reached 1.0.0, which incorporates where():

dta %>% select(where(function(x) sum(is.na(x)) / length(x) < 0.5))
like image 115
MS Berends Avatar answered Sep 30 '22 10:09

MS Berends


I think this does the job:

dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head() 


 observation    valueA    valueB    valueC
  1           0.2655087 0.9347052 0.8209463
  2           0.3721239        NA        NA
  3           0.5728534        NA        NA
  4           0.9082078        NA        NA
  5           0.2016819        NA        NA
  6           0.8983897 0.3861141        NA

`

like image 24
toscanouser Avatar answered Sep 30 '22 10:09

toscanouser