I'm average at R, and I would like some help with the following operation.
Let's say I have the following dataframe:
>df
ID Label
P1 M
P1 S
P2 M
P2 M
P2 S
P3 M
P3 S
P3 M
P4 S
P4 M
P5 M
P5 M
P5 S
I want to be able to select rows that occur in a particular sequence of the variable Label
with respect to each ID.
For a pattern "MS"
, the expected output would be
ID Label
P1 M
P1 S
P2 M
P2 S
P3 M
P3 S
and for a pattern "MMS"
, the expected output would be
ID Label
P2 M
P2 M
P2 S
P5 M
P5 M
P5 S
and for pattern "SM"
, the expected output is:
ID Label
P3 S
P3 M
P4 S
P4 M
Please consider the fact that the data I am working on has a lot of rows and the solution I need to build needs to work for patterns of arbitrary lengths. (ex. "MSS", "SM", "MMSSMS", etc). I humbly request for your assistance.
Edit: I have updated this question (the example dataframe and the output example of pattern "MMS"
. I want to add that I want the pattern matching to happen after grouping the data using ID
variable so that patterns can be found in the groups of data grouped by ID
. Sorry for not being clear the first time.
Final Edit: Answers from @akrun, @boski, and @tmfmnk are working for me. @boski's and @akrun's solutions were faster in execution time (~2-10 seconds on 400k rows of data) compared to @tmfmnk's solution (~29 seconds on 400k rows of data). I advise the reader to refer all three of these solutions.
One option would be to compare the lead
values and get the index grouped by 'ID'
library(data.table)
i1 <- unique(setDT(df)[, lapply(which(Reduce(`&`,
Map(`==`, shift(Label, n = 0:2, type = "lead"), c("M", "M", "S")))),
function(i) .I[i:(i+2)]) , by = ID]$V1)
df[i1]
# ID Label
#1: P2 M
#2: P2 M
#3: P2 S
#4: P5 M
#5: P5 M
#6: P5 S
df <- structure(list(ID = c("P1", "P1", "P2", "P2", "P2", "P3", "P3",
"P3", "P4", "P4", "P5", "P5", "P5"), Label = c("M", "S", "M",
"M", "S", "M", "S", "M", "S", "M", "M", "M", "S")),
class = "data.frame", row.names = c(NA, -13L))
You can try using gregexpr()
. First paste all labels and find the start position of the pattern you are looking for.
> df
ID Label
1 P1 M
2 P1 S
3 P2 M
4 P2 M
5 P2 S
6 P3 M
7 P3 S
8 P3 S
9 P4 S
10 P4 M
11 P5 M
12 P5 M
13 P5 S
Edit
My previous solution did not retrieve the whole pattern (just the start).
pattern="SM"
starts=gregexpr(pattern=pattern,paste(df$Label,collapse=""))[[1]]
positions=as.vector(sapply(starts,function(x){
s=seq(x,x+nchar(pattern)-1)
if (all(df$ID[s]==df$ID[x])){
return(s)
} else {return(rep(NA,nchar(pattern)))}
}))
positions=positions[which(!is.na(positions))]
df[positions,]
df[positions,]
ID Label
1 P1 M
2 P1 S
4 P2 M
5 P2 S
6 P3 M
7 P3 S
12 P5 M
13 P5 S
pattern="MMS"
ID Label
3 P2 M
4 P2 M
5 P2 S
11 P5 M
12 P5 M
13 P5 S
pattern="SM"
ID Label
9 P4 S
10 P4 M
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With