My question is a direct extension of this earlier question about detecting consecutive words (unigrams) in a string.
In the previous question,
Not that that is related
could be detected via this regex: \b(\w+)\s+\1\b
Here, I want to detect consecutive bigrams (pairs of words):
are blue and then and then very bright
Ideally, I also want to know how to replace the detected pattern (duplicate) by a single element, so as to obtain in the end:
are blue and then very bright
(for this application, if it matters, I am using gsub
in R)
The point here is that in some cases, there will be repeating substrings that include shorter repeated substrings. So, to match the longer ones, you would use
(\b.+\b)\1\b
(see the regex demo) and for those to find shorter substrings, I'd rely on lazy dot matching:
(\b.+?\b)\1\b
See this regex demo. The replacement string will be \1
- the backreference to the captured part matched first with the grouping construct (...)
.
You need a PCRE regex to make it work, since there are documented issues with matching multiple word boundaries with gsub
(so, add perl=T
argument).
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g.,
pattern = "\b"
). Useperl = TRUE
for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
Note that in case your repeated substrings can span across multiple lines, you can use the PCRE regex with the DOTALL modifier (?s)
at the start of the pattern (so that a .
could also match a newline symbol).
So, the R code would look like
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", s, perl=T)
or
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", s, perl=T)
See the IDEONE demo:
text <- "are blue and then and then more and then and then more very bright"
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", text, perl=T) ## shorter repeated substrings
## [1] "are blue and then more and then more very bright"
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", text, perl=T) ## longer repeated substrings
## [1] "are blue and then and then more very bright"
Try the following RegEx:
(\b.+?\b)\1\b
The RegEx will capture a word boundary, followed by the data and then another word boundary. The \1
will refer to what was captured, and select that again. It will then check for a word boundary the the end to prevent a and
and z zoo
from being selected
As for the replace, use \1
. This will contain the data from the 1st
Capture Group (the first part of the bigram), and that first part will be used to replace the whole thing.
Live Demo on Regex101
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With