I would like to implement a function with R
that removes repeated characters in a string. For instance, say my function is named removeRS
, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte') Buena Suerte removeRS('Hoy estoy tristeeeeeee') Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa
is replaced with a
.
Could you give me any hints to carry out this task with R
?
To remove duplicates in R, Use duplicated() method: It identifies the duplicate elements. Using unique() method: It extracts unique elements. dplyr package's distinct() function: Removing duplicate rows from a data frame.
How to remove a character or multiple characters from a string in R? You can either use R base function gsub() or use str_replace() from stringr package to remove characters from a string or text.
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte') # [1] "Buena Suerte"
()
captures a letter first, \\1
refers to that letter, +
means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]]
with a regex matching whatever you wish to include.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With