Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace repeating strings in a string

I'm trying to find (and replace) repeated string in a string.

My string can look like this:

Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.

This should become:

Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.

Note how the amit sit isn't removed since its not repeated.

Or the string can be like this:

Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.

which should become:

Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

So its not just a-z but can also have other (ascii) chars. I'm verry happy if someone can help me with this.

The next step would be to match (and replace) something like this:

2 questions 3 questions 4 questions 5 questions

which would become:

2 questions

The number in the final output can be any number 2,3,4, it doesn't matter. There will only be different numbers in the final example but the words will be the same.

like image 662
Nin Avatar asked Jul 21 '11 19:07

Nin


People also ask

How do you replace all occurrences of substring in a string?

To replace all occurrences of a substring in a string by a new one, you can use the replace() or replaceAll() method: replace() : turn the substring into a regular expression and use the g flag. replaceAll() method is more straight forward.

How do you replace a pattern in a string?

String.prototype.replace() The replace() method returns a new string with one, some, or all matches of a pattern replaced by a replacement . The pattern can be a string or a RegExp , and the replacement can be a string or a function called for each match.

How do I replace multiple characters in a string?

Use the replace() method to replace multiple characters in a string, e.g. str. replace(/[. _-]/g, ' ') . The first parameter the method takes is a regular expression that can match multiple characters.


2 Answers

If it helps, \1, \2, etc. is used to reference previous grouping. so, for example, the following would pick out repeated words and make them repeat only once:

$string =~ s/(\w+) ( \1)+/$1/g

Repeated phrases could be similiarly put.

like image 145
djhaskin987 Avatar answered Sep 28 '22 15:09

djhaskin987


Interesting question. This can be solved with a single preg_replace() statement but the length of the repeated phrase must be limited to avoid excessive backtracking. Here is a solution with a commented regex that works for the test data and fixes doubled, tripled (or repeated n times) phrases having a max length of 50 chars:

Solution to part 1:

$result = preg_replace('/
    # Match a doubled "phrase" having length up to 50 chars.
    (            # $1: Phrase having whitespace boundaries.
      (?<=\s|^)  # Assert phrase preceded by ws or BOL.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $1: Phrase
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \1         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1', $text);

Note that with this solution, a "phrase" can consist of a single word, and there are legitimate cases where doubled words are valid grammar and should not be fixed. If the above solution is not the desired behavior, the regex can be easily modified to define a "phrase" as being two or more "words".

Edit: Modified above regex to handle any number of phrase repetitions. Also added solution to the second part of the question below.

And here is a similar solution where the phrase begins with a word of digits and the repeating phrases must also begin with a word of digits (but the repeating phrases' first word of digits do not need to match the original):

Solution to part 2:

$result = preg_replace('/
    # Match doubled "phrases" with wildcard digits first word.
    (            # $1: 1st word of phrase (digits).
    \b           # Anchor 1st phrase word to word boundary.
    \d+          # Phrase 1st word is string of digits.
    \s+          # 1st and 2nd words separated by whitespace.
    )            # End $1:  1st word of phrase (digits).
    (            # $2: Part of phrase after 1st digits word.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $2: Part of phrase after 1st digits word.
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \d+        # Match duplicate of phrase.
      \s+        # Doubled phrase separated by whitespace.
      \2         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1$2', $text);
like image 24
ridgerunner Avatar answered Sep 28 '22 15:09

ridgerunner