Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom sum function in dplyr returns inconsistent results

Tags:

r

dplyr

I've made a custom sum function that ignores NAs unless all are NA. When I use it in dplyr it returns odd results and I don't know why.

require(dplyr)

dta <- data.frame(year=2007:2013, rrconf=c(79, NaN ,474,2792,1686,3313,3456), enrolled=c(NaN,NaN,458,1222,1155,1906,2184))

sum0 <- function(x, ...){
  # remove NAs unless all are NA
  if(is.na(mean(x, na.rm=TRUE))) return(NA)
  else(sum(x, ..., na.rm=TRUE))
} 

dta %>%
  group_by(year) %>%
  summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))

gives me

Source: local data frame [7 x 3]

  year rrconf enrolled
1 2007     79       NA
2 2008     NA       NA
3 2009    474     TRUE
4 2010   2792     TRUE
5 2011   1686     TRUE
6 2012   3313     TRUE
7 2013   3456     TRUE

In this case it is only summing over one value, but in my bigger application in might summer over multiple values. Wrapping my sum0 function in as.integer() seems to fix it, but I couldn't tell you why.

Is this the correct way to work around this problem? Is there something obvious I'm missing?

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2

loaded via a namespace (and not attached):
[1] assertthat_0.1 magrittr_1.0.1 parallel_3.1.0 Rcpp_0.11.2    tools_3.1.0 
like image 463
Tom Avatar asked Oct 14 '14 01:10

Tom


1 Answers

The issue seems to be with dplyr determining the column type in reference to the first returned result. If you force the NA value, which is by default a logical value, to be an NA_real_ or NA_integer_, then you will be sorted:

##Just to show what NA normally does first:
class(NA)
#[1] "logical"

sum0 <- function(x, ...){
  # remove NAs unless all are NA
  if(is.na(mean(x, na.rm=TRUE))) return(NA_real_)
  else(sum(x, ..., na.rm=TRUE))
} 

dta %>%
  group_by(year) %>%
  summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))

#Source: local data frame [7 x 3]
# 
#  year rrconf enrolled
#1 2007     79       NA
#2 2008     NA       NA
#3 2009    474      458
#4 2010   2792     1222
#5 2011   1686     1155
#6 2012   3313     1906
#7 2013   3456     2184
like image 129
thelatemail Avatar answered Oct 15 '22 02:10

thelatemail