Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-standard subsetting of data.frames

Tags:

r

evaluation

One of the quirks of subsetting a data frame is that you have to repeatedly type the name of that data frame when mentioning columns. For example, the data frame cars is mentioned 3 times here:

cars[cars$speed == 4 & cars$dist < 10, ]
##   speed dist
## 1     4    2

The data.table package solves this.

library(data.table)
dt_cars <- as.data.table(cars)
dt_cars[speed == 4 & dist < 10]

As does dplyr.

library(dplyr)
cars %>% filter(speed == 4, dist < 10)

I'd like to know if a solution exists for standard-issue data.frames (that is, not resorting to data.table or dplyr).

I think I'm looking for something like

cars[MAGIC(speed == 4 & dist < 10), ]

or

MAGIC(cars[speed == 4 & dist < 10, ])

where MAGIC is to be determined.

I tried the following, but it gave me an error.

library(rlang)
cars[locally(speed == 4 & dist < 10), ]
# Error in locally(speed == 4 & dist < 10) : object 'speed' not found
like image 851
Richie Cotton Avatar asked Dec 08 '22 16:12

Richie Cotton


2 Answers

1) subset This only requires that cars be mentioned once. No packages are used.

subset(cars, speed == 4 & dist < 10)
##   speed dist
## 1     4    2

2) sqldf This uses a package but does not use dplyr or data.table which were the only two packages excluded by the question:

library(sqldf)

sqldf("select * from cars where speed = 4 and dist < 10")
##   speed dist
## 1     4    2

3) assignment Not sure if this counts but you could assign cars to some other variable name such as . and then use that. In that case cars would only be mentioned once. This uses no packages.

. <- cars
.[.$speed == 4 & .$dist < 10, ]
##   speed dist
## 1     4    2

or

. <- cars
with(., .[speed == 4 & dist < 10, ])
##   speed dist
## 1     4    2

With respect to these two solutions you might want to check out this article on the Bizarro Pipe: http://www.win-vector.com/blog/2017/01/using-the-bizarro-pipe-to-debug-magrittr-pipelines-in-r/

4) magrittr This could also be expressed in magrittr and that package was not excluded by the question. Note we are using the magrittr %$% operator:

library(magrittr)

cars %$% .[speed == 4 & dist < 10, ]
##   speed dist
## 1     4    2
like image 55
G. Grothendieck Avatar answered Dec 29 '22 10:12

G. Grothendieck


subset is the base function which solves this problem. However, like all base R functions which use non-standard evaluation subset does not perform fully hygienic code expansion. So subset() evaluates the wrong variable when used within non-global scopes (such as in lapply loops).

As an example, here we define the variable var in two places, first in the global scope with value 40, then in a local scope with value 30. The use of local() here is for simplicity, however this would behave equivalently inside a function. Intuitively, we would expect subset to use the value 30 in the evaluation. However upon executing the following code we see instead the value 40 is used (so no rows are returned).

var <- 40

local({
  var <- 30
  dfs <- list(mtcars, mtcars)
  lapply(dfs, subset, mpg > var)
})

#> [[1]]
#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
#> <0 rows> (or 0-length row.names)
#> 
#> [[2]]
#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
#> <0 rows> (or 0-length row.names)

This happens because the parent.frame() used in subset() is the environment within the body of lapply() rather than the local block. Because all environments eventually inherit from the global environment the variable var is found there with value 40.

Hygienic variable expansion via quasiquotation (as implemented in the rlang package) solves this problem. We can define a variant of subset using tidy evaluation that works properly in all contexts. The code is derived from and largely identical to that of base::subset.data.frame().

subset2 <- function (x, subset, select, drop = FALSE, ...) {
  r <- if (missing(subset))
    rep_len(TRUE, nrow(x))
  else {
    r <- rlang::eval_tidy(rlang::enquo(subset), x)
    if (!is.logical(r))
      stop("'subset' must be logical")
    r & !is.na(r)
  }
  vars <- if (missing(select))
    TRUE
  else {
    nl <- as.list(seq_along(x))
    names(nl) <- names(x)
    rlang::eval_tidy(rlang::enquo(select), nl)
  }
  x[r, vars, drop = drop]
}

This version of subset behaves identically to base::subset.data.frame().

subset2(mtcars, gear > 4, disp:wt)
#>                 disp  hp drat    wt
#> Porsche 914-2  120.3  91 4.43 2.140
#> Lotus Europa    95.1 113 3.77 1.513
#> Ford Pantera L 351.0 264 4.22 3.170
#> Ferrari Dino   145.0 175 3.62 2.770
#> Maserati Bora  301.0 335 3.54 3.570

However subset2() does not suffer the scoping issues of subset. In our previous example the value 30 is used for var, as we would expect from lexical scoping rules.

local({
  var <- 30
  dfs <- list(mtcars, mtcars)
  lapply(dfs, subset2, mpg > var)
})

#> [[1]]
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#> Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
#> 
#> [[2]]
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#> Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

This allows non-standard evaluation to be used robustly in all contexts, not just in top level contexts as with previous approaches.

This makes functions which use non-standard evaluation much more useful. Before while they were nice to have for interactive use, you needed to use more verbose standard evaluation functions when writing functions and packages. Now the same function can be used in all contexts without needing to modify the code!

For more details on non-standard evaluation please see Lionel Henry's Tidy evaluation (hygienic fexprs) presentation, the rlang vignette on tidy evaluation and the programming with dplyr vignette.

like image 23
Jim Avatar answered Dec 29 '22 09:12

Jim