Execute dplyr operation only if column exists

Question

Drawing on the discussion on conditional dplyr evaluation I would like conditionally execute a step in pipeline depending on whether the reference column exists in the passed data frame.

Example

The results generated by 1) and 2) should be identical.

Existing column

# 1)
mtcars %>% 
  filter(am == 1) %>%
  filter(cyl == 4)

# 2)
mtcars %>%
  filter(am == 1) %>%
  {
    if("cyl" %in% names(.)) filter(cyl == 4) else .
  }

Unavailable column

# 1)
mtcars %>% 
  filter(am == 1)

# 2)    
mtcars %>%
  filter(am == 1) %>%
  {
    if("absent_column" %in% names(.)) filter(absent_column == 4) else .
  }

Problem

For the available column the passed object does not correspond to the initial data frame. The original code returns the error message:

Error in filter(cyl == 4) : object 'cyl' not found

I have tried alternative syntax (with no luck):

>> mtcars %>%
...   filter(am == 1) %>%
...   {
...     if("cyl" %in% names(.)) filter(.$cyl == 4) else .
...   }
 Show Traceback

 Rerun with Debug
 Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to an object of class "logical"

Follow-up

I wanted to expand this question that would account for the evaluation on the right-hand side of the == in filter call. For instance the syntax below attempts to filter on the first available value. mtcars %>%

filter({
    if ("does_not_ex" %in% names(.))
      does_not_ex
    else
      NULL
  } == {
    if ("does_not_ex" %in% names(.))
      unique(.[['does_not_ex']])
    else
      NULL
  })

Expectedly, the call evaluates to an error message:

Error in filter_impl(.data, quo) : Result must have length 32, not 0

When applied to existing column:

mtcars %>%
  filter({
    if ("mpg" %in% names(.))
      mpg
    else
      NULL
  } == {
    if ("mpg" %in% names(.))
      unique(.[['mpg']])
    else
      NULL
  })

It works with a warning message:

  mpg cyl disp  hp drat   wt  qsec vs am gear carb
1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Warning message: In { : longer object length is not a multiple of shorter object length

Follow-up question

Is there a neat way of expending the existing syntax in order to get conditional evaluation on the right-hand side of the filter call, ideally staying within dplyr workflow?

Eumenedies · Accepted Answer

Because of the way the scopes here work, you cannot access the dataframe from within your if statement. Fortunately, you don't need to.

Try:

mtcars %>%
  filter(am == 1) %>%
  filter({if("cyl" %in% names(.)) cyl else NULL} == 4)

Here you can use the '.' object within the conditional so you can check if the column exists and, if it exists, you can return the column to the filter function.

EDIT: as per docendo discimus' comment on the question, you can access the dataframe but not implicitly - i.e. you have to specifically reference it with .

s_pike · Answer

With across() in dplyr > 1.0.0 you can now use any_of when filtering. Compare original with all columns:

mtcars %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

With cyl removed, it throws an error:

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

Using any_of (note you have to write "cyl" and not cyl):

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(across(any_of("cyl"), ~.x == 4))
#N.B. this is equivalent to just filtering by `am == 1`.

Felipe Gerard · Answer

I know I'm late to the party, but here's an answer somewhat more in line with what you were originally thinking:

mtcars %>%
  filter(am == 1) %>%
  {
    if("cyl" %in% names(.)) filter(., cyl == 4) else .
  }

Basically, you were missing the . in filter. Note this is because the pipeline doesn't add . to filter(expr) since it is in an expression surrounded by {}.

biocyberman · Answer

Avoid this trap:

On a busy day, one might do like the following:

library(dplyr)
df <- data.frame(A = 1:3, B = letters[1:3], stringsAsFactors = F)
> df %>% mutate( C = ifelse("D" %in% colnames(.), D, B)) 
# Notice the values on "C" colum. No error thrown, but the logic and result is wrong
  A B C
1 1 a a
2 2 b a
3 3 c a

Why? Because "D" %in% colnames(.) returns only one value of TRUE or FALSE, and therefore ifelse operates only once. Then the value is broadcasted to the whole column!

Correct way:

> df %>% mutate( C = if("D" %in% colnames(.)) D else B)
  A B C
1 1 a a
2 2 b b
3 3 c c

Execute dplyr operation only if column exists

Tags:

function

dataframe

r

dplyr

lazy-evaluation

Example

Existing column

Unavailable column

Problem

Follow-up

Follow-up question

Konrad

4 Answers

Eumenedies

s_pike

Felipe Gerard

Avoid this trap:

Correct way:

biocyberman

Recent Activity

Donate For Us

Execute dplyr operation only if column exists

Tags:

function

dataframe

r

dplyr

lazy-evaluation

Example

Existing column

Unavailable column

Problem

Follow-up

Follow-up question

Konrad

4 Answers

Eumenedies

s_pike

Felipe Gerard

Avoid this trap:

Correct way:

biocyberman

Related questions

Recent Activity

Donate For Us