In R, I want to summarize my data after grouping it based on the runs of a variable x
(aka each group of the data corresponds to a subset of the data where consecutive x
values are the same). For instance, consider the following data frame, where I want to compute the average y
value within each run of x
:
(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
# x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7
In this example, the x
variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y
in those groups are 2, 4.5, 6, and 7.
It is easy to carry out this grouped operation in base R using tapply
, passing dat$y
as the data, using rle
to compute the run number from dat$x
, and passing the desired summary function:
tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean)
# 1 2 3 4
# 2.0 4.5 6.0 7.0
I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:
library(dplyr)
# First attempt
dat %>%
group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'
# Attempt 2 -- maybe "with" is the problem?
dat %>%
group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%
summarize(mean(y))
# Error: invalid subscript type 'closure'
For completeness, I could reimplement the rle
run id myself using cumsum
, head
, and tail
to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:
dat %>%
group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%
summarize(mean(y))
# run mean(y)
# (dbl) (dbl)
# 1 1 2.0
# 2 2 4.5
# 3 3 6.0
# 4 4 7.0
What is causing my rle
-based grouping code to fail in dplyr
, and is there any solution that enables me to keep using rle
when grouping by run id?
One option seems to be the use of {}
as in:
dat %>%
group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
summarize(mean(y))
#Source: local data frame [4 x 2]
#
# yy mean(y)
# (int) (dbl)
#1 1 2.0
#2 2 4.5
#3 3 6.0
#4 4 7.0
It would be nice if future dplyr versions also had an equivalent of data.table's rleid
function.
I noticed that this problem occurs when using a data.frame
or tbl_df
input but not, when using a tbl_dt
or data.table
input:
dat %>%
tbl_df %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'
dat %>%
tbl_dt %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Source: local data table [4 x 2]
yy mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
I reported this as an issue on dplyr's github page.
If you explicitly create a grouping variable g
it more or less works:
> dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%
group_by(g) %>% summarize(mean(y))
Source: local data frame [4 x 2]
g mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
I used transform
here because mutate
throws an error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With