One of the quirks of subsetting a data frame is that you have to repeatedly type the name of that data frame when mentioning columns. For example, the data frame cars
is mentioned 3 times here:
cars[cars$speed == 4 & cars$dist < 10, ]
## speed dist
## 1 4 2
The data.table
package solves this.
library(data.table)
dt_cars <- as.data.table(cars)
dt_cars[speed == 4 & dist < 10]
As does dplyr
.
library(dplyr)
cars %>% filter(speed == 4, dist < 10)
I'd like to know if a solution exists for standard-issue data.frames (that is, not resorting to data.table
or dplyr
).
I think I'm looking for something like
cars[MAGIC(speed == 4 & dist < 10), ]
or
MAGIC(cars[speed == 4 & dist < 10, ])
where MAGIC
is to be determined.
I tried the following, but it gave me an error.
library(rlang)
cars[locally(speed == 4 & dist < 10), ]
# Error in locally(speed == 4 & dist < 10) : object 'speed' not found
1) subset This only requires that cars
be mentioned once. No packages are used.
subset(cars, speed == 4 & dist < 10)
## speed dist
## 1 4 2
2) sqldf This uses a package but does not use dplyr or data.table which were the only two packages excluded by the question:
library(sqldf)
sqldf("select * from cars where speed = 4 and dist < 10")
## speed dist
## 1 4 2
3) assignment Not sure if this counts but you could assign cars
to some other variable name such as .
and then use that. In that case cars
would only be mentioned once. This uses no packages.
. <- cars
.[.$speed == 4 & .$dist < 10, ]
## speed dist
## 1 4 2
or
. <- cars
with(., .[speed == 4 & dist < 10, ])
## speed dist
## 1 4 2
With respect to these two solutions you might want to check out this article on the Bizarro Pipe: http://www.win-vector.com/blog/2017/01/using-the-bizarro-pipe-to-debug-magrittr-pipelines-in-r/
4) magrittr This could also be expressed in magrittr and that package was not excluded by the question. Note we are using the magrittr %$%
operator:
library(magrittr)
cars %$% .[speed == 4 & dist < 10, ]
## speed dist
## 1 4 2
subset
is the base function which solves this problem. However, like all base R functions which use non-standard evaluation subset
does not perform fully hygienic code expansion. So subset()
evaluates the wrong variable when used within non-global scopes (such as in lapply loops).
As an example, here we define the variable var
in two places, first in the global scope with value 40
, then in a local scope with value 30
. The use of local()
here is for simplicity, however this would behave equivalently inside a function. Intuitively, we would expect subset
to use the value 30
in the evaluation. However upon executing the following code we see instead the value 40
is used (so no rows are returned).
var <- 40
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset, mpg > var)
})
#> [[1]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
#>
#> [[2]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
This happens because the parent.frame()
used in subset()
is the environment within the body of lapply()
rather than the local block. Because all environments eventually inherit from the global environment the variable var
is found there with value 40
.
Hygienic variable expansion via quasiquotation (as implemented in the rlang package) solves this problem. We can define a variant of subset using tidy evaluation that works properly in all contexts. The code is derived from and largely identical to that of base::subset.data.frame()
.
subset2 <- function (x, subset, select, drop = FALSE, ...) {
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
r <- rlang::eval_tidy(rlang::enquo(subset), x)
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
rlang::eval_tidy(rlang::enquo(select), nl)
}
x[r, vars, drop = drop]
}
This version of subset behaves identically to base::subset.data.frame()
.
subset2(mtcars, gear > 4, disp:wt)
#> disp hp drat wt
#> Porsche 914-2 120.3 91 4.43 2.140
#> Lotus Europa 95.1 113 3.77 1.513
#> Ford Pantera L 351.0 264 4.22 3.170
#> Ferrari Dino 145.0 175 3.62 2.770
#> Maserati Bora 301.0 335 3.54 3.570
However subset2()
does not suffer the scoping issues of subset. In our previous example the value 30
is used for var
, as we would expect from lexical scoping rules.
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset2, mpg > var)
})
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
This allows non-standard evaluation to be used robustly in all contexts, not just in top level contexts as with previous approaches.
This makes functions which use non-standard evaluation much more useful. Before while they were nice to have for interactive use, you needed to use more verbose standard evaluation functions when writing functions and packages. Now the same function can be used in all contexts without needing to modify the code!
For more details on non-standard evaluation please see Lionel Henry's Tidy evaluation (hygienic fexprs) presentation, the rlang vignette on tidy evaluation and the programming with dplyr vignette.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With