Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R object not found if defined within a function when using data.table dplyr

Note The described behaviour has been fixed in the dev version of dplyr. You can install dplyr using devtools::install_github("hadley/dplyr")

Please see this minimal example; I am using dplyr v0.3.0.2 and data.table v1.9.4

library(dplyr)
library(data.table)
f <- function(x, y, bad) { 
  z <- data.table(x,y, key = "x")    
  z2 <- z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad))
  z2
}

f(rnorm(100), rnorm(100) < 0, bad = FALSE) 

When I run the above I get

Error in `[.data.table`(dt, , list(sum.bad = sum(y == bad)), by = vars) : 
  object 'bad' not found

However bad is clearly defined and in scope.

If I just run this outside of a function it works

  x <- rnorm(100)
  y <- rnorm(100) <0
  bad <- FALSE
  z <- data.table(x,y, key = "x")

  z2 <- z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad))
  z2

What is the issue here? Is it a bug with either data.table or dplyr?

like image 933
xiaodai Avatar asked Jan 05 '15 05:01

xiaodai


People also ask

Why can't R Find My object?

This error usually occurs for one of two reasons: Reason 1: You are attempting to reference an object you have not created. Reason 2: You are running a chunk of code where the object has not been defined in that chunk.

Why is my variable not being found in R?

The error means that R could not find the variable mentioned in the error message. The easiest way to reproduce the error is to type the name of a variable that doesn't exist. (If you've defined x already, use a different variable name.)

What does object not found in R mean?

If you try to refer to an object that has not been defined in an R code block or before it, you will raise the error object not found. The R interpreter could not find the variable mentioned in the error message. You can check if a variable exists using ls or exists, then create the variable if it does not exists.

What does “object not found” mean in R?

The “object not found r” error message does not necessarily involve a function, because it can occur anytime you call an r object. It occurs when R can not find a variable in a data set. As a result, it an easy error message in R script to understand. # R error object not found > a Error: object 'a' not found

What does it mean when a variable is not found in R?

It occurs when R can not find a variable in a data set. As a result, it an easy error message in R script to understand. Here is an example of the simplest possible case of this message. In this case, an object labeled “a” is called without having been previously defined.

Is it possible to use test() with a dplyr package?

However, if test () is defined and exported in a package, it does not work any more when data is a data.table and fun is a dplyr verb function. Does your new package Depend or Import data.table? See : Sorry, something went wrong. Thanks @mattdowle!

Does test() work when data is a table and fun is function?

However, if test () is defined and exported in a package, it does not work any more when data is a data.table and fun is a dplyr verb function. For example, # in some package #' @export test <- function ( data, fun) { function ( ...) { fun ( data, ... ) } }


1 Answers

Seems like this is a problem with how dplyr is setting up the environment to the data.table call. The problem appears in the dplyr:::summarise_.grouped_dt function. It currently looks like

function (.data, ..., .dots) 
{
    dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE)
    for (i in seq_along(dots)) {
        if (identical(dots[[i]]$expr, quote(n()))) {
            dots[[i]]$expr <- quote(.N)
        }
    }
    list_call <- lazyeval::make_call(quote(list), dots)
    call <- substitute(dt[, list_call, by = vars], list(list_call = list_call$expr))
    env <- dt_env(.data, parent.frame())
    out <- eval(call, env)
    grouped_dt(out, drop_last(groups(.data)), copy = FALSE)
}
<environment: namespace:dplyr>

and if we debug that function and look at the trace when it's called, we see

where 1: summarise_.grouped_dt(.data, .dots = lazyeval::lazy_dots(...))
where 2: summarise_(.data, .dots = lazyeval::lazy_dots(...))
where 3: summarise(., sum.bad = sum(y == bad))
where 4: function_list[[k]](value)
where 5: withVisible(function_list[[k]](value))
where 6: freduce(value, `_function_list`)
where 7: `_fseq`(`_lhs`)
where 8: eval(expr, envir, enclos)
where 9: eval(quote(`_fseq`(`_lhs`)), env, env)
where 10: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
where 11 at #3: z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad))
where 12: f(rnorm(100), rnorm(100) < 0, bad = FALSE)

So the important line is the

env <- dt_env(.data, parent.frame())

one. Here it's setting up the environment path which specifies where to look up all variables in the call. Here it's just using the parent.frame which is looks to where the function was called from, but since you actually jump through a few hoops to get to this function from your summarize call inside f(), this doesn't seem to be the right parent frame. If, instead you run

env <- dt_env(.data, parent.frame(2))

in debug mode, that seems to actually get at the correct parent frame. So i think the problem is the jump from summarize() to summarize_() because this

ff <- function(x, y, bad) { 
  z <- data.table(x,y, key = "x")    
  z2 <- z %>% group_by(x) %>% summarise_(.dots=list(sum.bad = quote(sum(y == bad))))
  z2
}

ff(rnorm(100), rnorm(100) < 0, bad = FALSE) 

seems to work. So it's really dplyr that needs to set up the correct environment. The tricky part is that appears to be different if you call summarize or summarize_ directly. Perhaps summarise() could change the environment when it calls summarise_ to have the same parent.frame via eval(). But I'd probably file this as a bug report and let Hadley decide how to fix it. Something like

summarise <- function(.data, ...) {
  call <- match.call()
  call <- as.call(c(as.list(call)[1:2], list(.dots=as.list(call)[-(1:2)])))
  call[[1]] <- quote(summarise_)
  eval(call, envir=parent.frame())
}

would be a "traditional" way to do it. Not sure if the lazyeval package has nicer ways to do this or not.

Tested with data.table_1.9.2 and dplyr_0.3.0.2

like image 85
MrFlick Avatar answered Oct 22 '22 13:10

MrFlick