I'm trying to find (and replace) repeated string in a string.
My string can look like this:
Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.
This should become:
Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.
Note how the amit sit isn't removed since its not repeated.
Or the string can be like this:
Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.
which should become:
Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
So its not just a-z but can also have other (ascii) chars. I'm verry happy if someone can help me with this.
The next step would be to match (and replace) something like this:
2 questions 3 questions 4 questions 5 questions
which would become:
2 questions
The number in the final output can be any number 2,3,4, it doesn't matter. There will only be different numbers in the final example but the words will be the same.
To replace all occurrences of a substring in a string by a new one, you can use the replace() or replaceAll() method: replace() : turn the substring into a regular expression and use the g flag. replaceAll() method is more straight forward.
String.prototype.replace() The replace() method returns a new string with one, some, or all matches of a pattern replaced by a replacement . The pattern can be a string or a RegExp , and the replacement can be a string or a function called for each match.
Use the replace() method to replace multiple characters in a string, e.g. str. replace(/[. _-]/g, ' ') . The first parameter the method takes is a regular expression that can match multiple characters.
If it helps, \1
, \2
, etc. is used to reference previous grouping. so, for example, the following would pick out repeated words and make them repeat only once:
$string =~ s/(\w+) ( \1)+/$1/g
Repeated phrases could be similiarly put.
Interesting question. This can be solved with a single preg_replace()
statement but the length of the repeated phrase must be limited to avoid excessive backtracking. Here is a solution with a commented regex that works for the test data and fixes doubled, tripled (or repeated n
times) phrases having a max length of 50 chars:
Solution to part 1:
$result = preg_replace('/
# Match a doubled "phrase" having length up to 50 chars.
( # $1: Phrase having whitespace boundaries.
(?<=\s|^) # Assert phrase preceded by ws or BOL.
\S # First char of phrase is non-whitespace.
.{0,49}? # Lazily match phrase (50 chars max).
) # End $1: Phrase
(?: # Group for one or more duplicate phrases.
\s+ # Doubled phrase separated by whitespace.
\1 # Match duplicate of phrase.
){1,} # Require one or more duplicate phrases.
/x', '$1', $text);
Note that with this solution, a "phrase" can consist of a single word, and there are legitimate cases where doubled words are valid grammar and should not be fixed. If the above solution is not the desired behavior, the regex can be easily modified to define a "phrase" as being two or more "words".
Edit: Modified above regex to handle any number of phrase repetitions. Also added solution to the second part of the question below.
And here is a similar solution where the phrase begins with a word of digits and the repeating phrases must also begin with a word of digits (but the repeating phrases' first word of digits do not need to match the original):
Solution to part 2:
$result = preg_replace('/
# Match doubled "phrases" with wildcard digits first word.
( # $1: 1st word of phrase (digits).
\b # Anchor 1st phrase word to word boundary.
\d+ # Phrase 1st word is string of digits.
\s+ # 1st and 2nd words separated by whitespace.
) # End $1: 1st word of phrase (digits).
( # $2: Part of phrase after 1st digits word.
\S # First char of phrase is non-whitespace.
.{0,49}? # Lazily match phrase (50 chars max).
) # End $2: Part of phrase after 1st digits word.
(?: # Group for one or more duplicate phrases.
\s+ # Doubled phrase separated by whitespace.
\d+ # Match duplicate of phrase.
\s+ # Doubled phrase separated by whitespace.
\2 # Match duplicate of phrase.
){1,} # Require one or more duplicate phrases.
/x', '$1$2', $text);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With