Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group a dataframe based on sequence breaks in a column?

Tags:

r

dplyr

I have a data.frame, which has a column of integer values. I need to form a grouping variable that identifies sequence breaks in that column. For instance, I could create another column of ascending integers that adds one whenever the original column's value is not greater than its lagged value. How do I do this?

E.g. if I have a data.frame like this:

df <- data.frame(A = c(1,2,4,6,78,3,56,78,23))

I need some way to produce new table with column B:

df$B <- c(1,1,1,1,1,2,2,2,3)

I have tried e.g. with dplyr:

df %>% mutate(B = 1,
              B = case_when(A < lag(A), B + 1))

That is not quite correct.

like image 679
Antti Avatar asked Mar 27 '19 11:03

Antti


1 Answers

We can use cumsum and diff which will increment the value every time the sequence is broken

cumsum(c(-1, diff(df$A)) < 0)
#[1] 1 1 1 1 1 2 2 2 3

We can also integrate into dplyr chain to get

library(dplyr)

df %>%
  mutate(B = cumsum(c(-1, diff(A)) < 0))

#   A B
#1  1 1
#2  2 1
#3  4 1
#4  6 1
#5 78 1
#6  3 2
#7 56 2
#8 78 2
#9 23 3

A hacky way using lag could be

df %>%
  mutate(B = cumsum(c(-1, (A - lag(A))[-1]) < 0))
like image 52
Ronak Shah Avatar answered Oct 03 '22 21:10

Ronak Shah