I have strings containing lots of duplicates, like this:
tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B",
"A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")
I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:
[1] "CBC*C" "A" "*B" "A*A*A" "*C" "A"
I've successfully extracted the duplicated capitals:
library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"
but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.
Any ideas?
EDIT:
I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:
str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"
[[2]]
[1] "A"
[[3]]
[1] "*" "B"
[[4]]
[1] "A" "*" "A" "*" "A"
[[5]]
[1] "*" "C"
[[6]]
[1] "A"
We can use gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With