Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

single js regex for matching repeating substrings?

Say I have a string, like:

where is mummy where is daddy

I want to replace any set of repeating substrings with empty strings - so in this case the where and is elements would be removed and the resulting string would be:

mummy daddy

I was wondering if there was any single regex that could achieve this. The regex I tried (which doesn't work) looked like the following:

/(\w+)(?=.*)\1/gi

Where the first capture group is any set of word characters, the second is a positive look ahead to any set of characters (in order to prevent those characters from being included in the result) and then the \1 is a backreference to the first matched substring.

Any help would be great. Thanks in advance!

like image 561
jonny Avatar asked Mar 21 '16 09:03

jonny


People also ask

How to match regex pattern in JavaScript?

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp , and with the match() , matchAll() , replace() , replaceAll() , search() , and split() methods of String .

How do you repeat a pattern in regex?

A repeat is an expression that is repeated an arbitrary number of times. An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once.

What is dot plus in regex?

The next token is the dot, which matches any character except newlines. The dot is repeated by the plus. The plus is greedy. Therefore, the engine will repeat the dot as many times as it can. The dot matches E, so the regex continues to try to match the dot with the next character.

How to escape in regex JavaScript?

* + ( ) literally, we need to prepend them with a backslash \ (“escape them”). We also need to escape / if we're inside /.../ (but not inside new RegExp ). When passing a string to new RegExp , we need to double backslashes \\ , cause string quotes consume one of them.


1 Answers

Your regex does not work because the \w+ is not restricted with word boundaries and the \1 backreference is tried to match right after the "original" word, which is almost never true.

You need to first get the words that are dupes, and then build a RegExp to match them all with optional whitespace (or punctuation, etc. - adjust the pattern later) and replace with an empty string:

var re = /(\b\w+\b)(?=.*\b\1\b)/gi;                  // Get the repeated whole words
var str = 'where is mummy where is daddy';
var patts = str.match(re);                       // Collect the matched repeated words
var res = str.replace(RegExp("\\s*\\b(?:" + patts.join("|") +")\\b", "gi"), ""); //  Build the pattern for replacing all found words
document.body.innerHTML = res;

The first pattern is (\b\w+\b)(?=.*\b\1\b):

  • (\b\w+\b) - match and capture into Group 1 a whole word consisting of [A-Za-z0-9_] characters
  • (?=.*\b\1\b) - make sure this value captured into Group 1 is repeated somewhere to the right of the current location (not necessarily right after the word). If the string is multiline, use [\s\S] instead of the dot. To make sure we match original and dupe words as whole words, \b word boundaries should be used around both \w+ and \1.

The second pattern will look different each time, but in your current scenario, it will be /\s*\b(?:where|is)\b/gi:

  • \s* - zero or more whitepsace
  • \b(?:where|is)\b - a whole word from the alternation group (?:...|...): either where or is (case-insensitive due to /i modifier).
like image 117
Wiktor Stribiżew Avatar answered Oct 24 '22 18:10

Wiktor Stribiżew