Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows from a dataframe based on a pattern formed by consecutive values in a column

Tags:

I'm average at R, and I would like some help with the following operation.

Let's say I have the following dataframe:

    >df
    ID   Label
    P1   M
    P1   S
    P2   M
    P2   M
    P2   S
    P3   M
    P3   S
    P3   M
    P4   S
    P4   M
    P5   M
    P5   M
    P5   S

I want to be able to select rows that occur in a particular sequence of the variable Label with respect to each ID.

For a pattern "MS", the expected output would be

    ID   Label
    P1   M
    P1   S
    P2   M
    P2   S
    P3   M
    P3   S

and for a pattern "MMS", the expected output would be

    ID   Label
    P2   M
    P2   M
    P2   S
    P5   M
    P5   M
    P5   S

and for pattern "SM", the expected output is:

    ID   Label
    P3   S
    P3   M
    P4   S
    P4   M

Please consider the fact that the data I am working on has a lot of rows and the solution I need to build needs to work for patterns of arbitrary lengths. (ex. "MSS", "SM", "MMSSMS", etc). I humbly request for your assistance.

Edit: I have updated this question (the example dataframe and the output example of pattern "MMS". I want to add that I want the pattern matching to happen after grouping the data using ID variable so that patterns can be found in the groups of data grouped by ID. Sorry for not being clear the first time.

Final Edit: Answers from @akrun, @boski, and @tmfmnk are working for me. @boski's and @akrun's solutions were faster in execution time (~2-10 seconds on 400k rows of data) compared to @tmfmnk's solution (~29 seconds on 400k rows of data). I advise the reader to refer all three of these solutions.

like image 945
jerome Avatar asked Feb 18 '19 14:02

jerome


2 Answers

One option would be to compare the lead values and get the index grouped by 'ID'

library(data.table)
i1 <- unique(setDT(df)[, lapply(which(Reduce(`&`, 
  Map(`==`, shift(Label, n = 0:2, type = "lead"), c("M", "M", "S")))), 
       function(i) .I[i:(i+2)]) , by = ID]$V1)
df[i1]
#    ID Label
#1: P2     M
#2: P2     M
#3: P2     S
#4: P5     M
#5: P5     M
#6: P5     S

data

df <- structure(list(ID = c("P1", "P1", "P2", "P2", "P2", "P3", "P3", 
"P3", "P4", "P4", "P5", "P5", "P5"), Label = c("M", "S", "M", 
"M", "S", "M", "S", "M", "S", "M", "M", "M", "S")), 
class = "data.frame", row.names = c(NA, -13L))
like image 147
akrun Avatar answered Sep 29 '22 08:09

akrun


You can try using gregexpr(). First paste all labels and find the start position of the pattern you are looking for.

> df
   ID Label
1  P1     M
2  P1     S
3  P2     M
4  P2     M
5  P2     S
6  P3     M
7  P3     S
8  P3     S
9  P4     S
10 P4     M
11 P5     M
12 P5     M
13 P5     S

Edit

My previous solution did not retrieve the whole pattern (just the start).

pattern="SM"
starts=gregexpr(pattern=pattern,paste(df$Label,collapse=""))[[1]]
positions=as.vector(sapply(starts,function(x){ 
  s=seq(x,x+nchar(pattern)-1)
  if (all(df$ID[s]==df$ID[x])){
    return(s)
  } else {return(rep(NA,nchar(pattern)))}
  }))
positions=positions[which(!is.na(positions))]

df[positions,]
df[positions,]
   ID Label
1  P1     M
2  P1     S
4  P2     M
5  P2     S
6  P3     M
7  P3     S
12 P5     M
13 P5     S

pattern="MMS"
   ID Label
3  P2     M
4  P2     M
5  P2     S
11 P5     M
12 P5     M
13 P5     S

pattern="SM"
   ID Label
9  P4     S
10 P4     M
like image 28
boski Avatar answered Sep 29 '22 06:09

boski