Remove duplicates within consecutive runs of characters

Question

I have strings containing lots of duplicates, like this:

tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B", 
     "A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")

I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:

[1] "CBC*C" "A"     "*B"    "A*A*A" "*C"    "A"

I've successfully extracted the duplicated capitals:

library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"

but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.

Any ideas?

EDIT:

I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:

str_extract_all(gsub(">", "", tst), "(.)(?!\1)")
[[1]]
[1] "C" "B" "C" "*" "C"

[[2]]
[1] "A"

[[3]]
[1] "*" "B"

[[4]]
[1] "A" "*" "A" "*" "A"

[[5]]
[1] "*" "C"

[[6]]
[1] "A"

akrun · Accepted Answer

We can use gsub

gsub("([A-Z*]>)\1+", "\1", tst)
#[1] "C>B>C>*>C"

In order to get the second result, remove the >

gsub(">", "", gsub("([A-Z*]\>)\1+", "\1", tst) ,fixed = TRUE)
#[1] "CBC*C"

Based on the OP's comments below, may be

gsub("(.)\1+", "\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\1+", "\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\1+", "\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\1+", "\1", gsub(">", "", "A>A>A>A"))
#[1] "A"

Remove duplicates within consecutive runs of characters

Tags:

regex

indexing

r

Chris Ruehlemann

1 Answers

akrun

Recent Activity

Donate For Us

Remove duplicates within consecutive runs of characters

Tags:

regex

indexing

r

Chris Ruehlemann

1 Answers

akrun

Related questions

Recent Activity

Donate For Us