Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using filter() with across() to keep all rows of a data frame that include a missing value for any variable

Tags:

Sometimes I want to view all rows in a data frame that will be dropped if I drop all rows that have a missing value for any variable. In this case, I'm specifically interested in how to do this with dplyr 1.0's across() function used inside of the filter() verb.

Here is an example data frame:

df <- tribble(   ~id, ~x, ~y,   1, 1, 0,   2, 1, 1,   3, NA, 1,   4, 0, 0,   5, 1, NA ) 

Code for keeping rows that DO NOT include any missing values is provided on the tidyverse website. Specifically, I can use:

df %>%    filter(     across(       .cols = everything(),       .fns = ~ !is.na(.x)     )   ) 

Which returns:

# A tibble: 3 x 3      id     x     y   <dbl> <dbl> <dbl> 1     1     1     0 2     2     1     1 3     4     0     0 

However, I can't figure out how to return the opposite -- rows with a missing value in any variable. The result I'm looking for is:

# A tibble: 2 x 3      id     x     y   <dbl> <dbl> <dbl> 1     3    NA     1 2     5     1    NA 

My first thought was just to remove the !:

df %>%    filter(     across(       .cols = everything(),       .fns = ~ is.na(.x)     )   ) 

But, that returns zero rows.

Of course, I can get the answer I want with this code if I know all variables that have a missing value ahead of time:

df %>%    filter(is.na(x) | is.na(y)) 

But, I'm looking for a solution that doesn't require me to know which variables have a missing value ahead of time. Additionally, I'm aware of how to do this with the filter_all() function:

df %>%    filter_all(any_vars(is.na(.))) 

But, the filter_all() function has been superseded by the use of across() in an existing verb. See https://dplyr.tidyverse.org/articles/colwise.html

Other unsuccessful attempts I've made are:

df %>%    filter(     across(       .cols = everything(),       .fns = ~any_vars(is.na(.x))     )   )  df %>%    filter(     across(       .cols = everything(),       .fns = ~!!any_vars(is.na(.x))     )   )  df %>%    filter(     across(       .cols = everything(),       .fns = ~!!any_vars(is.na(.))     )   )  df %>%    filter(     across(       .cols = everything(),       .fns = ~any(is.na(.x))     )   )  df %>%    filter(     across(       .cols = everything(),       .fns = ~any(is.na(.))     )   ) 
like image 605
Brad Cannell Avatar asked Jun 02 '20 21:06

Brad Cannell


People also ask

Which function is used to filter rows based conditions?

Syntax. The FILTER function filters an array based on a Boolean (True/False) array. Notes: An array can be thought of as a row of values, a column of values, or a combination of rows and columns of values.

How do I select a row based on a column value in R?

By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.

How do I filter multiple values in a column in R?

In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.


2 Answers

It's now possible with dplyr 1.0.4. The new if_any() replaces across() for the filtering use-case.

library(dplyr)  df <- tribble(~ id, ~ x, ~ y,               1, 1, 0,               2, 1, 1,               3, NA, 1,               4, 0, 0,               5, 1, NA)  df %>%    filter(if_any(everything(), is.na)) #> # A tibble: 2 x 3 #>      id     x     y #>   <dbl> <dbl> <dbl> #> 1     3    NA     1 #> 2     5     1    NA 

Created on 2021-02-10 by the reprex package (v0.3.0)

See here for more details: https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/

like image 64
Emman Avatar answered Sep 18 '22 15:09

Emman


We can use reduce

library(dplyr) library(purrr) df %>%        filter(across(everything(), is.na) %>% reduce(`|`)) # A tibble: 2 x 3 #     id     x     y #  <dbl> <dbl> <dbl> #1     3    NA     1 #2     5     1    NA 
like image 33
akrun Avatar answered Sep 18 '22 15:09

akrun