Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Identify a sequence of row elements by groups in a dataframe

Tags:

dataframe

r

dplyr

Consider the following sample dataframe:

> df
   id name time
1   1    b   10
2   1    b   12
3   1    a    0
4   2    a    5
5   2    b   11
6   2    a    9
7   2    b    7
8   1    a   15
9   2    b    1
10  1    a    3

df = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L), 
    name = c("b", "b", "a", "a", "b", "a", "b", "a", "b", "a"
    ), time = c(10L, 12L, 0L, 5L, 11L, 9L, 7L, 15L, 1L, 3L)), .Names = c("id", 
"name", "time"), row.names = c(NA, -10L), class = "data.frame")

I need to identify and record all sequences seq <- c("a","b"), where "a" precedes "b" based on "time" column, for each id. No other names between "a" and "b" are permitted. Real sequence length is at least 5. The expected result for the sample data is

  a  b
1 3 10
2 5  7
3 9 11

There is a similar question Finding rows in R dataframe where a column value follows a sequence. However, it is not clear to me how to deal with "id" column in my case. Is it a way to solve the problem using "dplyr"?

like image 941
dmitriy873 Avatar asked Dec 13 '16 21:12

dmitriy873


2 Answers

library(dplyr); library(tidyr)

# sort data frame by id and time
df %>% arrange(id, time) %>% group_by(id) %>% 

       # get logical vector indicating rows of a followed by b and mark each pair as unique
       # by cumsum
       mutate(ab = name == "a" & lead(name) == "b", g = cumsum(ab)) %>% 

       # subset rows where conditions are met
       filter(ab | lag(ab)) %>% 

       # reshape your data frame to wide format
       select(-ab) %>% spread(name, time)


#Source: local data frame [3 x 4]
#Groups: id [2]

#     id     g     a     b
#* <int> <int> <int> <int>
#1     1     1     3    10
#2     2     1     5     7
#3     2     2     9    11

If length of the sequence is larger than two, then you will need to check multiple lags, and one option of this is to use shift function(which accepts a vector as lag/lead steps) from data.table combined with Reduce, say if we need to check pattern abb:

library(dplyr); library(tidyr); library(data.table)
pattern = c("a", "b", "b")
len_pattern = length(pattern)

df %>% arrange(id, time) %>% group_by(id) %>% 

       # same logic as before but use Reduce function to check multiple lags condition
       mutate(ab = Reduce("&", Map("==", shift(name, n = 0:(len_pattern - 1), type = "lead"), pattern)), 
              g = cumsum(ab)) %>% 

       # use reduce or to subset sequence rows having the same length as the pattern
       filter(Reduce("|", shift(ab, n = 0:(len_pattern - 1), type = "lag"))) %>% 

       # make unique names
       group_by(g, add = TRUE) %>% mutate(name = paste(name, 1:n(), sep = "_")) %>% 

       # pivoting the table to wide format
       select(-ab) %>% spread(name, time) 

#Source: local data frame [1 x 5]
#Groups: id, g [1]

#     id     g   a_1   b_2   b_3
#* <int> <int> <int> <int> <int>
#1     1     1     3    10    12
like image 164
Psidom Avatar answered Oct 29 '22 23:10

Psidom


It's somewhat convoluted, but how about a rolling join?

library(data.table)
setorder(setDT(df), id, time)

df[ name == "b" ][
    df[, if(name == "a") .(time = last(time)), by=.(id, name, r = rleid(id,name))],
    on = .(id, time),
    roll = -Inf,
    nomatch = 0,
    .(a = i.time, b = x.time)
]

   a  b
1: 3 10
2: 5  7
3: 9 11
like image 6
Frank Avatar answered Oct 29 '22 22:10

Frank