Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression For Consecutive Duplicate Bigrams

Tags:

regex

r

gsub

My question is a direct extension of this earlier question about detecting consecutive words (unigrams) in a string.

In the previous question,

Not that that is related

could be detected via this regex: \b(\w+)\s+\1\b

Here, I want to detect consecutive bigrams (pairs of words):

are blue and then and then very bright

Ideally, I also want to know how to replace the detected pattern (duplicate) by a single element, so as to obtain in the end:

are blue and then very bright

(for this application, if it matters, I am using gsub in R)

like image 762
Antoine Avatar asked Apr 20 '16 15:04

Antoine


2 Answers

The point here is that in some cases, there will be repeating substrings that include shorter repeated substrings. So, to match the longer ones, you would use

(\b.+\b)\1\b

(see the regex demo) and for those to find shorter substrings, I'd rely on lazy dot matching:

(\b.+?\b)\1\b

See this regex demo. The replacement string will be \1 - the backreference to the captured part matched first with the grouping construct (...).

You need a PCRE regex to make it work, since there are documented issues with matching multiple word boundaries with gsub (so, add perl=T argument).

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

Note that in case your repeated substrings can span across multiple lines, you can use the PCRE regex with the DOTALL modifier (?s) at the start of the pattern (so that a . could also match a newline symbol).

So, the R code would look like

gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", s, perl=T)

or

gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", s, perl=T)

See the IDEONE demo:

text <- "are blue and then and then more and then and then more very bright"
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", text, perl=T) ## shorter repeated substrings
## [1] "are blue and then more and then more very bright"
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", text, perl=T) ## longer repeated substrings
## [1] "are blue and then and then more very bright"
like image 116
Wiktor Stribiżew Avatar answered Sep 29 '22 20:09

Wiktor Stribiżew


Try the following RegEx:

(\b.+?\b)\1\b

The RegEx will capture a word boundary, followed by the data and then another word boundary. The \1 will refer to what was captured, and select that again. It will then check for a word boundary the the end to prevent a and and z zoo from being selected

As for the replace, use \1. This will contain the data from the 1st Capture Group (the first part of the bigram), and that first part will be used to replace the whole thing.

Live Demo on Regex101

like image 20
Kaspar Lee Avatar answered Sep 29 '22 20:09

Kaspar Lee