I have a large text that contains expressions such as: "aaaahahahahaha that was a good joke".
after processing, I want the "aaaaahahahaha"
to disappear, or at least, change it to simply "ha"
.
At the moment, I am using this:
gsub('(.+?)\\1', '', str)
This works when the string with the pattern is at the beginning of the sentence, but not where is located anywhere else. So:
str <- "aaaahahahahaha that was a good joke"
gsub('(.+?)\\1', '', str)
#[1] "ha that was a good joke"`
But
str <- "that was aaaahahahahaha a good joke"
gsub('(.+?)\\1', '', str)
#[1] "that was aaaahahahahaha a good joke"
This question might relate to this: find repeated pattern in python, but I can't find the equivalence in R.
I am assuming is very simple and perhaps I am missing something trivial, but since regular expressions are not my strength and I have already tried a bunch of things that have not worked, I was wondering if someone could help me. The question is: How to find, and substitute, repeated patterns in a string of characters in R?
Thanks in advance for your time.
\b(\S+?)\1\S*\b
Use this.See demo.
https://regex101.com/r/sJ9gM7/46
For r
use \\b(\\S+?)\\1\\S*\\b
with perl=TRUE
option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With