I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by @alistaire in the comments, you can actually mutate within group_by()
with the same syntax, combining the first two steps. Stealing data.table::rleid()
and using summarise_all()
to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With