Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Call custom function with if statement in the summarize function in dplyr

Tags:

r

dplyr

I need to call a custom function to do some calculation. In this function, there is one if statement to check the input values. But my codes don't return values I expected.

Created a test data.frame

library(dplyr)
df <- expand.grid(x = 2:4, y = 2:4, z = 2:4)
df$value <- df$x
df <- df%>% tbl_df %>% group_by(x, y)

test_fun1 just return sum of all values

test_fun1 <- function(value)
{
    return(sum(value))
}
df %>% summarize(t  = test_fun1(value))

test_fun1 return results as my expected

Source: local data frame [4 x 3]
Groups: x

  x y t
1 1 1 2
2 1 2 2
3 2 1 4
4 2 2 4

Then I add a if statement to check whether all values equal.

test_fun2 <- function(value)
{
    if (all(value == 2))
    {
        return (NA)
    }
    return(sum(value))
}
df  %>% summarize(t  = test_fun2(value))

But test_fun2 return TRUE for values are more than 2

Source: local data frame [9 x 3]
Groups: x

  x y    t
1 2 2   NA
2 2 3   NA
3 2 4   NA
4 3 2 TRUE
5 3 3 TRUE
6 3 4 TRUE
7 4 2 TRUE
8 4 3 TRUE
9 4 4 TRUE

Results are as expected for other values for test_fun3 for other values.

test_fun3 <- function(value)
{
    if (all(value != 3))
    {
        return(sum(value))
    }
    return (NA)

}
df  %>% summarize(t  = test_fun3(value))

I could get the similar results for 4 or 5

Source: local data frame [9 x 3]
Groups: x

  x y  t
1 2 2  6
2 2 3  6
3 2 4  6
4 3 2 NA
5 3 3 NA
6 3 4 NA
7 4 2 12
8 4 3 12
9 4 4 12

In my real data, I got FALSE of non NA testing, but can not create a reproduce example here.

Any ideas about this problem? Thanks for any suggestions.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2

loaded via a namespace (and not attached):
[1] assertthat_0.1.0.99 magrittr_1.0.1      parallel_3.1.0     
[4] Rcpp_0.11.1         tools_3.1.0        
like image 800
Bangyou Avatar asked Aug 21 '14 03:08

Bangyou


1 Answers

The problem is obviously, that mutate tries to determine the class of the column from the first assignment and applies this class to all other groups. And the class of NA is (in your case unfortunately) logical. For more details you can have a look here https://github.com/hadley/dplyr/issues/299

I would suggest, that you work around this by assigning a casted NA. See also ? NA

test_fun2 <- function(value) {
  if (all(value == 2)) {
    return (NA_integer_)
  }
  return(sum(value))
}

df  %>% summarize(t  = test_fun2(value))

Source: local data frame [9 x 3]
Groups: x

  x y  t
1 2 2 NA
2 2 3 NA
3 2 4 NA
4 3 2  9
5 3 3  9
6 3 4  9
7 4 2 12
8 4 3 12
9 4 4 12
like image 79
Beasterfield Avatar answered Oct 14 '22 05:10

Beasterfield