Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Using dplyr inside a function. exception in eval(expr, envir, enclos): unknown column

Tags:

function

r

dplyr

I have created a function in R based on the kind help of @Jim M.

When i run the function i get the error: Error: unknown column 'rawdata' When looking at the debugger i get the message: Rcpp::exception in eval(expr, envir, enclos): unknown column 'rawdata'

However when i look at the environment window i can see 2 variables which I have passed to the function and they contain information rawdata with 7 level factors and refdata with 28 levels

function (refdata, rawdata)
{
  wordlist <- expand.grid(rawdata = rawdata, refdata = refdata,     stringsAsFactors = FALSE)
  wordlist %>% group_by(rawdata) %>% mutate(match_score =     jarowinkler(rawdata, refdata)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])
}
like image 688
John Smith Avatar asked Mar 18 '15 10:03

John Smith


1 Answers

This is the problem with functions using NSE (non-standard evaluation). Functions using NSE are very useful in interactive programming but cause many problems in development i.e. when you try to use those inside other functions. Due to expressions not being evaluated directly, R is not able to find the objects in the environments it looks in. I can suggest you read here and preferably the scoping issues chapter for more info.

First of all you need to know that ALL the standard dplyr functions use NSE. Let's see an approximate example to your problem:

Data:

df <- data.frame(col1 = rep(c('a','b'), each=5), col2 = runif(10))


> df
   col1       col2
1     a 0.03366446
2     a 0.46698763
3     a 0.34114682
4     a 0.92125387
5     a 0.94511394
6     b 0.67241460
7     b 0.38168131
8     b 0.91107090
9     b 0.15342089
10    b 0.60751868

Let's see how NSE will make our simple problem crush:

First of all the simple interactive case works:

df %>% group_by(col1) %>% summarise(count = n())

Source: local data frame [2 x 2]

  col1 count
1    a     5
2    b     5

Let's see what happens if I put it in a function:

lets_group <- function(column) {
  df %>% group_by(column) %>% summarise(count = n())
}

>lets_group(col1)
Error: index out of bounds 

Not the same error as yours but it is caused by NSE. Exactly the same line of code worked outside the function.

Fortunately, there is a solution to your problem and that is standard evaluation. Hadley also made versions of all the functions in dplyr that use standard evaluation. They are just the normal functions plus the _ underscore at the end.

Now look at how this will work:

#notice the formula operator (~) at the function at summarise_
lets_group2 <- function(column) {
  df %>% group_by_(column) %>% summarise_(count = ~n())
}

This yields the following result:

#also notice the quotes around col1
> lets_group2('col1')
Source: local data frame [2 x 2]

  col1 count
1    a     5
2    b     5

I cannot test your problem but using SE instead of NSE will give you the results you want. For more info you can also read here

like image 158
LyzandeR Avatar answered Oct 26 '22 08:10

LyzandeR