Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping of R dataframe by connected values

I didn't find a solution for this common grouping problem in R:

This is my original dataset

ID  State
1   A
2   A
3   B
4   B
5   B
6   A
7   A
8   A
9   C
10  C

This should be my grouped resulting dataset

State   min(ID) max(ID)
A       1       2
B       3       5
A       6       8
C       9       10

So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.

Any ideas?

like image 885
HansHupe Avatar asked Sep 15 '16 12:09

HansHupe


1 Answers

You could try:

library(dplyr)
df %>%
  mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
  group_by(rleid) %>%
  summarise(State = first(State), min = min(ID), max = max(ID)) %>%
  select(-rleid)

Or as per mentioned by @alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:

df %>% 
  group_by(State, rleid = data.table::rleid(State)) %>% 
  summarise_all(funs(min, max)) %>% 
  select(-rleid)

Which gives:

## A tibble: 4 × 3
#   State   min   max
#  <fctr> <int> <int>
#1      A     1     2
#2      B     3     5
#3      A     6     8
#4      C     9    10
like image 94
Steven Beaupré Avatar answered Sep 19 '22 18:09

Steven Beaupré