Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr summarise_each() with is.na()

Tags:

r

dplyr

I'm trying to wrap some dplyr magic inside a function to produce a data.frame that I then print with xtable.

The ultimate aim is to have a dplyr version of this working, and reading around I came across the very useful summarise_each() function which after subsetting with regroup() (since this is within a function) I can then use to get all columns parsed.

The problem I've encountered (so far) is with calling is.na() from within summarise_each(funs(is.na)) as I'm told Error: expecting a single value.

I'm purposefully not posting my function just yet but a minimal example follows (NB - This uses group_by() whilst in my function I replace this with regroup())...

library(dplyr)
library(magrittr)
> t <- data.frame(grp = rbinom(10, 1, 0.5),
                a = as.factor(round(rnorm(10))),
                b = rnorm(10),
                c = rnorm(10))
t %>%
group_by(grp) %>%  ## This is replaced with regroup() in my function
summarise_each(funs(is.na))
Error: expecting a single value

Running this fails, and its the call to is.na() that is the problem since if I instead work out the number of observations in each (required to derive the proportion of missing) it works...

> t %>%
group_by(grp) %>%  ## This is replaced with regroup() in my function
summarise_each(funs(length))
Source: local data frame [2 x 4]

  grp a b c
1   0 8 8 8
2   1 2 2 2

The real problem though is that I do not need just is.na() within each column, but the sum(is.na()) as per the linked example so what I really would like is...

> t %>%
group_by(grp) %>%  ## This is replaced with regroup() in my function
summarise_each(funs(propmiss = sum(is.na) / length))

But the problem is that sum(is.na) doesn't work as I expect it to (likely because my expectation is wrong!)...

> t %>%
group_by(grp) %>%  ## This is replaced with regroup() in my function
summarise_each(funs(nmiss = sum(is.na)))
Error in sum(.Primitive("is.na")) : invalid 'type' (builtin) of argument

I tried calling is.na() explicitly with the brackets but that too returns an error...

> t %>%
+ group_by(grp) %>%  ## This is replaced with regroup() in my function
+ summarise_each(funs(nmiss      = sum(is.na())))
Error in is.na() : 0 arguments passed to 'is.na' which requires 1

Any advice or pointers to documentation would be very gratefully received.

Thanks,

slackline

like image 792
slackline Avatar asked Sep 24 '14 13:09

slackline


1 Answers

Here's a possibility, tested on a small data set with some NA:

df <- data.frame(a = rep(1:2, each = 3),
                 b = c(1, 1, NA, 1, NA, NA),
                 c = c(1, 1, 1, NA, NA, NA))

df
#   a  b  c
# 1 1  1  1
# 2 1  1  1
# 3 1 NA  1
# 4 2  1 NA
# 5 2 NA NA
# 6 2 NA NA


df %>% 
  group_by(a) %>%
  summarise_each(funs(sum(is.na(.)) / length(.)))
#   a         b c
# 1 1 0.3333333 0
# 2 2 0.6666667 1

And because you asked for pointers to documentation: The . refers to each piece of the data, and is used in some Examples in ?summarize_each. It is described in the Arguments section of ?funs as a "dummy parameter" , and is used the Examples. The . is also briefly described in the Arguments section of ?do: "... You can use . to refer to the current group"

like image 95
Henrik Avatar answered Sep 22 '22 02:09

Henrik