Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use rle to group by runs when using dplyr

In R, I want to summarize my data after grouping it based on the runs of a variable x (aka each group of the data corresponds to a subset of the data where consecutive x values are the same). For instance, consider the following data frame, where I want to compute the average y value within each run of x:

(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
#   x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7

In this example, the x variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y in those groups are 2, 4.5, 6, and 7.

It is easy to carry out this grouped operation in base R using tapply, passing dat$y as the data, using rle to compute the run number from dat$x, and passing the desired summary function:

tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean)
#   1   2   3   4 
# 2.0 4.5 6.0 7.0 

I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:

library(dplyr)
# First attempt
dat %>%
  group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
  summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'

# Attempt 2 -- maybe "with" is the problem?
dat %>%
  group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%
  summarize(mean(y))
# Error: invalid subscript type 'closure'

For completeness, I could reimplement the rle run id myself using cumsum, head, and tail to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:

dat %>%
  group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%
  summarize(mean(y))
#     run mean(y)
#   (dbl)   (dbl)
# 1     1     2.0
# 2     2     4.5
# 3     3     6.0
# 4     4     7.0

What is causing my rle-based grouping code to fail in dplyr, and is there any solution that enables me to keep using rle when grouping by run id?

like image 929
josliber Avatar asked Feb 06 '16 21:02

josliber


2 Answers

One option seems to be the use of {} as in:

dat %>%
    group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
    summarize(mean(y))
#Source: local data frame [4 x 2]
#
#     yy mean(y)
#  (int)   (dbl)
#1     1     2.0
#2     2     4.5
#3     3     6.0
#4     4     7.0

It would be nice if future dplyr versions also had an equivalent of data.table's rleid function.


I noticed that this problem occurs when using a data.frame or tbl_df input but not, when using a tbl_dt or data.table input:

dat %>% 
    tbl_df %>% 
    group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
    summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'

dat %>% 
    tbl_dt %>% 
    group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
    summarize(mean(y))
Source: local data table [4 x 2]

     yy mean(y)
  (int)   (dbl)
1     1     2.0
2     2     4.5
3     3     6.0
4     4     7.0

I reported this as an issue on dplyr's github page.

like image 93
talat Avatar answered Nov 10 '22 14:11

talat


If you explicitly create a grouping variable g it more or less works:

> dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%                                   
 group_by(g) %>% summarize(mean(y))
Source: local data frame [4 x 2]

      g mean(y)
  (int)   (dbl)
1     1     2.0
2     2     4.5
3     3     6.0
4     4     7.0

I used transform here because mutate throws an error.

like image 2
Neal Fultz Avatar answered Nov 10 '22 15:11

Neal Fultz