thanks for your time.
This is probably an obvious issue I'm overlooking, but I came across some unexpected behavior this morning using dplyr::filter()
.
Using filter()
seems to work, except when the column name and the object name are equivalent. See the below example for details.
I'm expecting data
to only return the rows where data$year
matches year
or data$month
matches month
, but it's returning all values instead.
I've done this same operation many times before, so I'm not sure why it's occurring this time.
When renaming month
to month_by_a_different_name
, everything works as expected. Any ideas? Thanks for your time.
library(tidyverse)
# Example data
data <-
tibble(
year = c(2019, 2018, 2017),
month = c("January", "February", "March"),
value = c(1, 2, 3)
)
# -----------------------------------------------
# Values to filter by
year <- 2019
month <- "February"
# Assigning year and month to a different object name
year_by_a_different_name <- year
month_by_a_different_name <- month
# -----------------------------------------------
# Filtering using year and month doesn't work
data %>%
dplyr::filter(year == year) # Doesn't work
data %>%
dplyr::filter(month == month) # Doesn't work
# -----------------------------------------------
# Filtering using different names works
data %>%
filter(year == year_by_a_different_name) # Works
data %>%
filter(month == month_by_a_different_name) # Works
# -----------------------------------------------
# Using str_detect() also doesn't work for month
data %>%
dplyr::filter(str_detect(month, month))
# -----------------------------------------------
# Works with base R
data[data$month == month, ]
data[data$year == year, ]
# -----------------------------------------------
# Objects are of same class
class(data$year) == class(year) # TRUE
class(data$month) == class(month) # TRUE
TLDR: use filter(year == !!year)
This is caused by dplyr's nonstandard evaluation (NSE) - it's ambiguous whether you're referring to df$year
or your external variable year
.
NSE uses so called 'quosures' to infer that when you write year
on the LHS, you are referring to the column of the column of the pipe-input. This quoting-trick is what allows you to refer to names defined in the scope of the pipe-input (i.e. data frame columns) in the tidyverse family of packages, and makes your life much easier by (i) avoiding having to type quotation-marks everywhere and (ii) allows Rstudio to give you autocomplete suggestions.
However, in your case here, year
on the RHS is meant to refer to something outside of the input data.frame, even though the name is also used there. In that case, the !!
("bangbang") operator tells NSE that your variable should not be quoted, but instead evaluated as is.
You can find more information here: https://dplyr.tidyverse.org/articles/programming.html, especially the section on "Different Expressions". From the vignette above:
In dplyr (and in tidyeval in general) you use !! to say that you want to unquote an input so that it’s evaluated, not quoted. This gives us a function that actually does what we want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With