I have two related use-cases in which I need to summarise just parts of a table, specified in a way similar to filter
.
In a nutshell, I want something like this:
iris %>%
use_only(Species == 'setosa') %>%
summarise_each(funs(sum), -Species) %>%
mutate(Species = 'setosa_sum') %>%
use_all()
To yield this:
Source: local data frame [101 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 250.3 171.4 73.1 12.3 setosa_sum
2 7.0 3.2 4.7 1.4 versicolor
3 6.4 3.2 4.5 1.5 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 5.5 2.3 4.0 1.3 versicolor
…
So instead of grouping by the value of a column, I use a filtering criterion to operate on a view of the table, without actually losing the rest of the table (unlike filter).
How do I smartly implement use_only
/use_all
? Even better, is this functionality already contained in dplyr
and how do I use it?
It’s of course quite easy to generate the result above, but I need to do something similar for many different cases, with complex and variable criteria for filtering.
I implemented this with the approach of having use_only
save the rest of the table into a global option dplyr_use_only_rest
, and having use_all
bind it back together.
use_only <- function(.data, ...) {
if (!is.null(.data$.index)) {
stop("data cannot already have .index column, would be overwritten")
}
filt <- .data %>%
mutate(.index = row_number()) %>%
filter(...)
rest <- .data %>% slice(-filt$.index)
options(dplyr_use_only_rest = rest)
select(filt, -.index)
}
use_all <- function(.data, ...) {
rest <- getOption("dplyr_use_only_rest")
if (is.null(rest)) {
stop("called use_all() without earlier use_only()")
}
options(dplyr_use_only_rest = NULL)
bind_rows(.data, rest)
}
I recognize setting global options is less than ideal design for functional programming, but I don't think there's another way to ensure that the remainder of the data frame passes through any intermediate functions untouched. Adding an extra attribute to the object wouldn't survive functions such as do
or summarize
.
At this point,
iris %>%
use_only(Species == 'setosa') %>%
summarise_each(funs(sum), -Species) %>%
mutate(Species = 'setosa_sum') %>%
use_all()
returns, as desired:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 250.3 171.4 73.1 12.3 setosa_sum
2 7.0 3.2 4.7 1.4 versicolor
3 6.4 3.2 4.5 1.5 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 5.5 2.3 4.0 1.3 versicolor
...
Any intermediate steps could be used in place of summarize_each
and mutate
(do
, filter
, etc) and they would happen only to the specified rows. You could even add or remove columns (the remainder would be filled in with NA
s).
I think your approach of searching for a function to satisfy that particular syntax is too restrictive. This is what I would do using data.table
(I'm not sure if dplyr
allows for variable rows like this yet, I know it's been an FR for a while):
library(data.table)
dt = as.data.table(iris)
dt[, if (Species == 'setosa') lapply(.SD, sum) else .SD, by = Species]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 250.3 171.4 73.1 12.3
# 2: versicolor 7.0 3.2 4.7 1.4
# 3: versicolor 6.4 3.2 4.5 1.5
# 4: versicolor 6.9 3.1 4.9 1.5
# 5: versicolor 5.5 2.3 4.0 1.3
# ---
You can also add [Species == 'setosa', Species := 'setosa_sum']
at the end to modify the name in place. It should be straightforward to extend to multiple criteria/whatever function.
You can create a new column to group by:
iris %>%
mutate( group1 = ifelse(Species == "setosa", "", row_number())) %>%
group_by( group1, Species ) %>%
summarise_each(funs(sum), -Species, -group1) %>%
ungroup() %>%
select(-group1)
library(lazyeval)
use_only_ <- function(x, condition, ...) {
condition <- as.lazy(condition, parent.frame())
mutate_(x, .group = condition) %>%
group_by_(".group", ...)
}
use_only <- function(x, condition, ...) {
use_only_(x, lazy(condition), ...)
}
use_all <- function(x) {
ungroup(x) %>%
select(- .group)
}
Use use_only
with any condition in the context of data frame and calling environment. In this case:
iris %>%
use_only( ifelse(Species == "setosa", "", row_number()), "Species") %>%
summarise_each(funs(sum), -Species, -.group) %>%
use_all()
The use_only_
can be used with formula or string. For example:
condition <- ~ifelse(Species == "setosa", "", row_number())
or
condition <- "ifelse(Species == 'setosa' , "", row_number())"
And call:
iris %>%
use_only_(condition, "Species") %>%
summarise_each(funs(sum), -Species, -.group) %>%
use_all()
When mutate-ing between the use_only
and use_all
calls you must take care to change only values inside marked group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With