Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates within consecutive runs of characters

Tags:

regex

indexing

r

I have strings containing lots of duplicates, like this:

tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B", 
     "A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")

I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:

[1] "CBC*C" "A"     "*B"    "A*A*A" "*C"    "A"

I've successfully extracted the duplicated capitals:

library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"

but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.

Any ideas?

EDIT:

I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:

str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"

[[2]]
[1] "A"

[[3]]
[1] "*" "B"

[[4]]
[1] "A" "*" "A" "*" "A"

[[5]]
[1] "*" "C"

[[6]]
[1] "A"
like image 592
Chris Ruehlemann Avatar asked Dec 31 '22 16:12

Chris Ruehlemann


1 Answers

We can use gsub

gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"

In order to get the second result, remove the >

gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"

Based on the OP's comments below, may be

gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
like image 189
akrun Avatar answered Jan 14 '23 06:01

akrun