Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match and remove duplicated characters: Replace multiple (3+) non-consecutive occurrences

I am looking for a regex pattern that will match third, fourth, ... occurrence of each character. Look below for clarification:

For example I have the following string:

111aabbccxccybbzaa1

I want to replace all the duplicated characters after the second occurrence. The output will be:

11-aabbccx--y--z---

Some regex patterns that I tried so far:

Using the following regex I can find the last occurrence of each character:

(.)(?=.*\1)

Or using this one I can do it for consecutive duplicates but not for any duplicates:

([a-zA-Z1-9])\1{2,}

like image 892
M-- Avatar asked Dec 11 '19 20:12

M--


3 Answers

Non-regex R solution. Split string. Replace elements of this vector having rowid >= 3 * with '-'. Paste it back together.

x <- '111aabbccxccybbzaa1'

xsplit <- strsplit(x, '')[[1]]
xsplit[data.table::rowid(xsplit) >= 3] <- '-'
paste(xsplit, collapse = '')

# [1] "11-aabbccx--y--z---"

* rowid(x) is an integer vector with each element representing the number of times the value from the corresponding element of x has been realized. So if the last element of x is 1, and that's the fourth time 1 has occurred in x, the last element of rowid(x) is 4.

like image 114
IceCreamToucan Avatar answered Oct 31 '22 23:10

IceCreamToucan


You can easily accomplish this without regex:

See code in use here

s = '111aabbccxccybbzaa1'

for u in set(s):
    for i in [i for i in range(len(s)) if s[i]==u][2:]:
        s = s[:i]+'-'+s[i+1:]

print(s)

Result:

11-aabbccx--y--z---

How this works:

  1. for u in set(s) gets a list of unique characters in the string: {'c','a','b','y','1','z','x'}
  2. for i in ... loops over the indices that we gather in 3.
  3. [i for i in range(len(s)) if s[i]==u][2:] loops over each character in the string and checks if it matches u (from step 1.), then it slices the array from the 2nd element to the end (dropping the first two elements if they exist)
  4. Set the string to s[:i]+'-'+s[i+1:] - concatenate the substring up to the index with - and then the substring after the index, effectively omitting the original character.
like image 4
ctwheels Avatar answered Oct 31 '22 23:10

ctwheels


An option with gsubfn

library(gsubfn)
p <- proto(fun = function(this, x) if (count >=3) '-' else x)
for(i in c(0:9, letters)) x <- gsubfn(i, p, x)
x
#[1] "11-aabbccx--y--z---"

data

x <- '111aabbccxccybbzaa1'
like image 3
akrun Avatar answered Nov 01 '22 00:11

akrun