Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected Behavior with dplyr::filter()

thanks for your time.

This is probably an obvious issue I'm overlooking, but I came across some unexpected behavior this morning using dplyr::filter().

Using filter() seems to work, except when the column name and the object name are equivalent. See the below example for details.

I'm expecting data to only return the rows where data$year matches year or data$month matches month, but it's returning all values instead.

I've done this same operation many times before, so I'm not sure why it's occurring this time.

When renaming month to month_by_a_different_name, everything works as expected. Any ideas? Thanks for your time.

library(tidyverse)

# Example data
data <-
  tibble(
    year = c(2019, 2018, 2017),
    month = c("January", "February", "March"),
    value = c(1, 2, 3)
  )


# -----------------------------------------------

# Values to filter by
year <- 2019
month <-  "February"

# Assigning year and month to a different object name
year_by_a_different_name <- year
month_by_a_different_name <- month


# -----------------------------------------------

# Filtering using year and month doesn't work
data %>%
  dplyr::filter(year == year)        # Doesn't work

data %>%
  dplyr::filter(month == month)      # Doesn't work


# -----------------------------------------------

# Filtering using different names works
data %>%
  filter(year == year_by_a_different_name)       # Works

data %>% 
  filter(month == month_by_a_different_name)     # Works


# -----------------------------------------------

# Using str_detect() also doesn't work for month
data %>% 
  dplyr::filter(str_detect(month, month))


# -----------------------------------------------

# Works with base R
data[data$month == month, ]
data[data$year == year, ]


# -----------------------------------------------

# Objects are of same class
class(data$year) == class(year)      # TRUE
class(data$month) == class(month)    # TRUE
like image 544
Ethan Wicker Avatar asked Mar 04 '23 04:03

Ethan Wicker


1 Answers

TLDR: use filter(year == !!year)

This is caused by dplyr's nonstandard evaluation (NSE) - it's ambiguous whether you're referring to df$year or your external variable year. NSE uses so called 'quosures' to infer that when you write year on the LHS, you are referring to the column of the column of the pipe-input. This quoting-trick is what allows you to refer to names defined in the scope of the pipe-input (i.e. data frame columns) in the tidyverse family of packages, and makes your life much easier by (i) avoiding having to type quotation-marks everywhere and (ii) allows Rstudio to give you autocomplete suggestions.

However, in your case here, year on the RHS is meant to refer to something outside of the input data.frame, even though the name is also used there. In that case, the !! ("bangbang") operator tells NSE that your variable should not be quoted, but instead evaluated as is.

You can find more information here: https://dplyr.tidyverse.org/articles/programming.html, especially the section on "Different Expressions". From the vignette above:

In dplyr (and in tidyeval in general) you use !! to say that you want to unquote an input so that it’s evaluated, not quoted. This gives us a function that actually does what we want.

like image 78
hdkrgr Avatar answered Mar 11 '23 22:03

hdkrgr