Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I summarise only part of a table?

Tags:

I have two related use-cases in which I need to summarise just parts of a table, specified in a way similar to filter.

In a nutshell, I want something like this:

iris %>%
    use_only(Species == 'setosa') %>%
    summarise_each(funs(sum), -Species) %>%
    mutate(Species = 'setosa_sum') %>%
    use_all()

To yield this:

Source: local data frame [101 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1         250.3       171.4         73.1        12.3 setosa_sum
2           7.0         3.2          4.7         1.4 versicolor
3           6.4         3.2          4.5         1.5 versicolor
4           6.9         3.1          4.9         1.5 versicolor
5           5.5         2.3          4.0         1.3 versicolor
…

So instead of grouping by the value of a column, I use a filtering criterion to operate on a view of the table, without actually losing the rest of the table (unlike filter).

How do I smartly implement use_only/use_all? Even better, is this functionality already contained in dplyr and how do I use it?

It’s of course quite easy to generate the result above, but I need to do something similar for many different cases, with complex and variable criteria for filtering.

like image 606
Konrad Rudolph Avatar asked Apr 21 '15 22:04

Konrad Rudolph


3 Answers

I implemented this with the approach of having use_only save the rest of the table into a global option dplyr_use_only_rest, and having use_all bind it back together.

use_only <- function(.data, ...) {
    if (!is.null(.data$.index)) {
        stop("data cannot already have .index column, would be overwritten")
    }
    filt <- .data %>%
        mutate(.index = row_number()) %>%
        filter(...)

    rest <- .data %>% slice(-filt$.index)
    options(dplyr_use_only_rest = rest)
    select(filt, -.index)
}

use_all <- function(.data, ...) {
    rest <- getOption("dplyr_use_only_rest")
    if (is.null(rest)) {
        stop("called use_all() without earlier use_only()")
    }
    options(dplyr_use_only_rest = NULL)
    bind_rows(.data, rest)
}

I recognize setting global options is less than ideal design for functional programming, but I don't think there's another way to ensure that the remainder of the data frame passes through any intermediate functions untouched. Adding an extra attribute to the object wouldn't survive functions such as do or summarize.

At this point,

iris %>%
    use_only(Species == 'setosa') %>%
    summarise_each(funs(sum), -Species) %>%
    mutate(Species = 'setosa_sum') %>%
    use_all()

returns, as desired:

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1         250.3       171.4         73.1        12.3 setosa_sum
2           7.0         3.2          4.7         1.4 versicolor
3           6.4         3.2          4.5         1.5 versicolor
4           6.9         3.1          4.9         1.5 versicolor
5           5.5         2.3          4.0         1.3 versicolor
...

Any intermediate steps could be used in place of summarize_each and mutate (do, filter, etc) and they would happen only to the specified rows. You could even add or remove columns (the remainder would be filled in with NAs).

like image 124
David Robinson Avatar answered Sep 24 '22 12:09

David Robinson


I think your approach of searching for a function to satisfy that particular syntax is too restrictive. This is what I would do using data.table (I'm not sure if dplyr allows for variable rows like this yet, I know it's been an FR for a while):

library(data.table)
dt = as.data.table(iris)

dt[, if (Species == 'setosa') lapply(.SD, sum) else .SD, by = Species]
#        Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#  1:     setosa        250.3       171.4         73.1        12.3
#  2: versicolor          7.0         3.2          4.7         1.4
#  3: versicolor          6.4         3.2          4.5         1.5
#  4: versicolor          6.9         3.1          4.9         1.5
#  5: versicolor          5.5         2.3          4.0         1.3
# ---                                                             

You can also add [Species == 'setosa', Species := 'setosa_sum'] at the end to modify the name in place. It should be straightforward to extend to multiple criteria/whatever function.

like image 14
eddi Avatar answered Sep 23 '22 12:09

eddi


You can create a new column to group by:

iris %>%
  mutate( group1 = ifelse(Species == "setosa", "", row_number()))  %>%
  group_by( group1, Species ) %>%
  summarise_each(funs(sum), -Species, -group1) %>%
  ungroup() %>%
  select(-group1)

Update - as more general solution

library(lazyeval)

use_only_ <- function(x, condition, ...) {
  condition <- as.lazy(condition, parent.frame())
  mutate_(x, .group = condition) %>% 
    group_by_(".group", ...)
}

use_only <- function(x, condition, ...) {
  use_only_(x, lazy(condition), ...)
}

use_all <- function(x) {
  ungroup(x) %>%
    select(- .group)
}

Use use_only with any condition in the context of data frame and calling environment. In this case:

iris %>%
  use_only( ifelse(Species == "setosa", "", row_number()), "Species") %>%
  summarise_each(funs(sum), -Species, -.group) %>%
  use_all()

The use_only_ can be used with formula or string. For example:

condition <- ~ifelse(Species == "setosa", "", row_number())

or

condition <- "ifelse(Species == 'setosa' , "", row_number())"

And call:

iris %>%
  use_only_(condition, "Species") %>%
  summarise_each(funs(sum), -Species, -.group) %>%
  use_all()

When mutate-ing between the use_only and use_all calls you must take care to change only values inside marked group.

like image 5
bergant Avatar answered Sep 26 '22 12:09

bergant