Sometimes I want to view all rows in a data frame that will be dropped if I drop all rows that have a missing value for any variable. In this case, I'm specifically interested in how to do this with dplyr
1.0's across()
function used inside of the filter()
verb.
Here is an example data frame:
df <- tribble( ~id, ~x, ~y, 1, 1, 0, 2, 1, 1, 3, NA, 1, 4, 0, 0, 5, 1, NA )
Code for keeping rows that DO NOT include any missing values is provided on the tidyverse website. Specifically, I can use:
df %>% filter( across( .cols = everything(), .fns = ~ !is.na(.x) ) )
Which returns:
# A tibble: 3 x 3 id x y <dbl> <dbl> <dbl> 1 1 1 0 2 2 1 1 3 4 0 0
However, I can't figure out how to return the opposite -- rows with a missing value in any variable. The result I'm looking for is:
# A tibble: 2 x 3 id x y <dbl> <dbl> <dbl> 1 3 NA 1 2 5 1 NA
My first thought was just to remove the !
:
df %>% filter( across( .cols = everything(), .fns = ~ is.na(.x) ) )
But, that returns zero rows.
Of course, I can get the answer I want with this code if I know all variables that have a missing value ahead of time:
df %>% filter(is.na(x) | is.na(y))
But, I'm looking for a solution that doesn't require me to know which variables have a missing value ahead of time. Additionally, I'm aware of how to do this with the filter_all()
function:
df %>% filter_all(any_vars(is.na(.)))
But, the filter_all()
function has been superseded by the use of across()
in an existing verb. See https://dplyr.tidyverse.org/articles/colwise.html
Other unsuccessful attempts I've made are:
df %>% filter( across( .cols = everything(), .fns = ~any_vars(is.na(.x)) ) ) df %>% filter( across( .cols = everything(), .fns = ~!!any_vars(is.na(.x)) ) ) df %>% filter( across( .cols = everything(), .fns = ~!!any_vars(is.na(.)) ) ) df %>% filter( across( .cols = everything(), .fns = ~any(is.na(.x)) ) ) df %>% filter( across( .cols = everything(), .fns = ~any(is.na(.)) ) )
Syntax. The FILTER function filters an array based on a Boolean (True/False) array. Notes: An array can be thought of as a row of values, a column of values, or a combination of rows and columns of values.
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.
It's now possible with dplyr
1.0.4. The new if_any()
replaces across()
for the filtering use-case.
library(dplyr) df <- tribble(~ id, ~ x, ~ y, 1, 1, 0, 2, 1, 1, 3, NA, 1, 4, 0, 0, 5, 1, NA) df %>% filter(if_any(everything(), is.na)) #> # A tibble: 2 x 3 #> id x y #> <dbl> <dbl> <dbl> #> 1 3 NA 1 #> 2 5 1 NA
Created on 2021-02-10 by the reprex package (v0.3.0)
See here for more details: https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/
We can use reduce
library(dplyr) library(purrr) df %>% filter(across(everything(), is.na) %>% reduce(`|`)) # A tibble: 2 x 3 # id x y # <dbl> <dbl> <dbl> #1 3 NA 1 #2 5 1 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With